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Preface 


A Wasserstein distance is a metric between probability distributions u and v ona 
ground space 2’, induced by the problem of optimal mass transportation or simply 
optimal transport. It reflects the minimal effort that is required in order to reconfig- 
ure the mass of u to produce the mass distribution of v. The ‘effort’ corresponds 
to the total work needed to achieve this reconfiguration, where work equals the 
amount of mass at the origin times the distance to the prescribed destination of this 
mass. The distance between origin and destination can be raised to some power 
other than 1 when defining the notion of work, giving rise to correspondingly differ- 
ent Wasserstein distances. When viewing the space of probability measures on 2 
as a metric space endowed with a Wasserstein distance, we speak of a Wassertein 
Space. 

Mass transportation and the associated Wasserstein metrics/spaces are ubiqui- 
tous in mathematics, with a long history that has seen them catalyse core devel- 
opments in analysis, optimisation, and probability. Beyond their intrinsic mathe- 
matical richness, they possess attractive features that make them a versatile tool 
for the statistician. They frequently appear in the development of statistical the- 
ory and inferential methodology, sometimes as a technical tool in asymptotic the- 
ory, due to the useful topology they induce and their easy majorisation; and other 
times as a methodological tool, for example, in structural modelling and goodness- 
of-fit testing. A more recent trend in statistics is to consider Wasserstein spaces 
themselves as a sample and/or parameter space and treat inference problems in 
such spaces. It is this more recent trend that is the topic of this book and is 
coming to be known as ‘statistics in Wasserstein spaces’ or ‘statistical optimal 
transport’. 

From the theoretical point of view, statistics in Wasserstein spaces represents an 
emerging topic in mathematical statistics, situated at the interface between func- 
tional data analysis (where the data are functions, seen as random elements of an 
infinite-dimensional Hilbert space) and non-Euclidean statistics (where the data 
satisfy non-linear constraints, thus lying on non-Euclidean manifolds). Wasser- 
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stein spaces provide the natural mathematical formalism to describe data collec- 
tions that are best modelled as random measures on R? (e.g. images and point 
processes). Such random measures carry the infinite-dimensional traits of func- 
tional data, but are intrinsically non-linear due to positivity and integrability re- 
strictions. Indeed, contrarily to functional data, their dominating statistical variation 
arises through random (non-linear) deformations of an underlying template, rather 
than the additive (linear) perturbation of an underlying template. This shows op- 
timal transport to be a canonical framework for dealing with problems involving 
the so-called phase variation (also known as registration, multi-reference align- 
ment, or synchronisation problems). This connection is pursued in detail in this 
book and linked with the so-called problem of optimal multitransport (or optimal 
multicoupling). 


In writing our monograph, we had two aims in mind: 


1. To present the key aspects of optimal transportation and Wasserstein spaces 
(Chaps. 1 and 2) relevant to statistical inference, tailored to the interests and 
background of the (mathematical) statistician. There are, of course, classic texts 
comprehensively covering this background.! But their choice of topics and style 
of exposition are usually adapted to the analyst and/or probabilist, with aspects 
most relevant for statisticians scattered among (much) other material. 

2. To make use of the “Wasserstein background’ to present some of the funda- 
mentals of statistical estimation in Wasserstein spaces, and its connection to the 
problem of phase variation (registration) and optimal multicoupling. In doing 
so, we highlight connections with classical topics in statistical shape theory, 
such as Procrustes analysis. On these topics, no book/monograph appears to yet 
exist. 


The book focusses on the theory of statistics in Wasserstein spaces. It does 
not cover the associated computational/numerical aspects. This is partially due 
to space restrictions, but also due to the fact that a reference entirely dedicated 
to such issues can be found in the very recent monograph of Peyré and Cu- 
turi [103]. Moreover, since this book is meant to be a rapid introduction for 
non-specialists, we have made no attempt to give a complete bibliography. We 
have added some bibliographic remarks at the end of each chapter, but these are 
in no way meant to be exhaustive. For those seeking reference works, Rachev 
[106] is an excellent overview of optimal transport up to 1985. Other recent re- 
views are Bogachev and Kolesnikov [26] and Panaretos and Zemel [101]. The lat- 
ter review can be thought of as complementary to the present book and surveys 
some of the applications of optimal transport methods to statistics and probability 
theory. 


l E.g. by Rachev and Rüschendorf [107], Villani [124, 125], Ambrosio and Gigli [10], Ambrosio 
et al. [12], and more recently by Santambrogio [119]. 
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Structure of the Book 


The material is organised into five chapters. 


Chapter | presents the necessary background in optimal transportation. Starting 
with Monge’s original formulation, it presents Kantorovich’s probabilistic relax- 
ation and the associated duality theory. It then focusses on quadratic cost func- 
tions (squared normed cost) and gives a more detailed treatment of certain impor- 
tant special cases. Topics of statistical concern such as the regularity of transport 
maps and their stability under weak convergence of the origin/destination mea- 
sures are also presented. The chapter concludes with a consideration of more gen- 
eral cost functions and the characterisation of optimal transport plans via cyclical 
monotonicity. 

Chapter 2 presents the salient features of (f2-)Wasserstein space starting with 
topological properties of statistical importance, as well as metric properties such 
as covering numbers. It continues with geometrical features of the space, review- 
ing the tangent bundle structure of the space, the characterisation of geodesics, 
and the log and exponential maps as related to transport maps. Finally, it reviews 
the relationship between the curvature and the so-called compatibility of trans- 
port maps, roughly speaking when can one expect optimal transport maps to form 
a group. 

Chapter 3 starts to shift attention to issues more statistical and treats the prob- 
lem of existence, uniqueness, characterisation, and regularity of Fréchet means 
(barycenters) for collections of measures in Wasserstein space. This is done by 
means of the so-called multimarginal transport problem (a.k.a. optimal multi- 
transport or optimal multicoupling problem). The treatment starts with finite col- 
lections of measures, and then considers Fréchet means for (potentially uncount- 
ably supported) probability distributions on Wasserstein space and associated 
measurability concerns. 

Chapter 4 considers the problem of estimation of the Fréchet mean of a prob- 
ability distribution in Wasserstein space, on the basis of a finite collection of 
i.i.d. elements from this law observed with ‘sampling noise’. It is shown that 
this problem is inextricably linked to the problem of separation of amplitude and 
phase variation (a.k.a. registration) of random point patters, where the focus is on 
estimating the maps yielding the optimal multicoupling rather than the Fréchet 
mean itself. Nonparametric methodology for solving either problem is reviewed, 
coupled with associated asymptotic theory and several illustrative examples. 
Chapter 5 focusses on the problem of actually constructing the Fréchet mean 
and/or optimal multicoupling of a collection of measures, which is a necessary 
step when using the methods of Chap. 4 in practice. It presents the steepest de- 
scent algorithm based on the geometrical features reviewed in Chap. 2 and a con- 
vergence analysis thereof. Interestingly, it is seen that the algorithm is closely 
related to Procrustes algorithms in shape theory, and this connection is discussed 
in depth. Several special cases are reviewed in more detail. 
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Each chapter comes with some bibliographic notes at the end, giving some back- 
ground and suggesting further reading. The first two chapters can be used indepen- 
dently as a crash course in optimal transport for statisticians at the MSc or PhD level 
depending on the audience’s background. Proofs that were omitted from the main 
text due to space limitations have been organised into an online supplement and can 
be accessed from any online version chapter published on SpringerLink website: 
https://link.springer.com/book/10.1007/978-3-030-38438-8. 
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Chapter 1 | ® | 
Optimal Transport | 


Check for 
updates | 


In this preliminary chapter, we introduce the problem of optimal transport, which 
is the main concept behind Wasserstein spaces. General references on this topic are 
the books by Rachev and Rüschendorf [107], Villani [124, 125], Ambrosio et al. 
[12], Ambrosio and Gigli [10], and Santambrogio [119]. This chapter includes only 
few proofs, when they are simple, informative, or are not easily found in one of the 
cited references. 


1.1 The Monge and the Kantorovich Problems 


In 1781, Monge [95] asked the following question: given a pile of sand and a pit 
of equal volume, how can one optimally transport the sand into the pit? In modern 
mathematical terms, the problem can be formulated as follows. There is a sand space 
X , a pit space Y, and a cost function c : 2 x Y > R that encapsulates how 
costly it is to move a unit of sand at x € 2 to a location y € Y in the pit. The 
sand distribution is represented by a measure u on .2’, and the shape of the pit is 
described by a measure v on %. Our decision where to put each unit of sand can be 
thought of as a function T : 2° — Y, and it incurs a total transport cost of 


C(T) = [eT aul 


Moreover, one cannot put all the sand at a single point y in the pit; it is not allowed to 
shrink or expand the sand. The map T must be mass-preserving: for any subset B C 
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Y representing a region of the pit of volume v(B), exactly that same volume of sand 
must go into B. The amount of sand allocated to B is {x € 2 : T(x) € B} = T7! (B), 
so the mass preservation requirement is that u(T~!(B)) = v (B) for all B C Y. This 
condition will be denoted by T#u = v and in words: v is the push-forward of u un- 
der T, or T pushes u forward to v. To make the discussion mathematically rigorous, 
we must assume that c and T are measurable maps, and that u(T~!(B)) = v(B) for 
all measurable subsets of &. When the underlying measures are understood from 
the context, we call T a transport map. Specifying B = Y, we see that no such T 
can exist unless 1(.2°) = v(Y); we shall assume that this quantity is finite, and by 
means of normalisation, that and v are probability measures. In this setting, the 
Monge problem is to find the optimal transport map, that is, to solve 


inf C(T). 
T:T#u=v 

We assume throughout this book that 2° and Y are complete and separable met- 
ric spaces, | endowed with their Borel o-algebra, which, we recall, is defined as 
the smallest o-algebra containing the open sets. Measures defined on the Borel o- 
algebra of 2% are called Borel measures. Thus, if u is a Borel measure on 2’, then 
L(A) is defined for any A that is open, or closed, or a countable union of closed sets, 
etc., and any continuous map on 2 is measurable. Similarly, we endow Y with its 
Borel o-algebra. The product space 2 x Y is also complete and separable when 
endowed with its product topology; its Borel o-algebra is generated by the product 
o-algebra of those of 2 and %; thus, any continuous cost functionc: 2 x Y% +R 
is measurable. It will henceforth always be assumed, without explicit further notice, 
that u and v are Borel measures on 2 and Y, respectively, and that the cost func- 
tion is continuous and nonnegative. 

It is quite natural to assume that the cost is an increasing function of the dis- 
tance between x and y, such as a power function. More precisely, that Y% = 2 isa 
complete and separable metric space with metric d, and 


c(x,y)=d?(x,y), p20, x,ye X. (1.1) 


In particular, c is continuous, hence measurable, if p > 0. The limit case p = 0 yields 
the discontinuous function c(x,y) = 1{x = y}, which nevertheless remains measur- 
able because the diagonal {(x,x) : x € X } is measurable in 2 x X. Particular 
focus will be put on the quadratic case p = 2 (Sect. 1.6) and the linear case p = 1 
(Sect. 1.8.2). 

The problem introduced by Monge [95] is very difficult, mainly because the set 
of transport maps {T : T# = v} is intractable. And, it may very well be empty: this 
will be the case if u is a Dirac measure at some xp E€ 2 (meaning that (A) = 1 if 
xo € A and 0 otherwise) but v is not. Indeed, in that case the set B = {T (xo) } satisfies 
u(T~!(B)) = 1 > v(B), so no such T can exist. This also shows that the problem 
is asymmetric in u and v: in the Dirac example, there always exists a map T such 
that T#v = u—the constant map T (x) = xo for all x is the unique such map. A less 


' But see the bibliographical notes for some literature on more general spaces. 
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extreme situation occurs in the case of absolutely continuous measures. If u and v 
have densities f and g on R? and T is continuously differentiable, then T#u = v if 
and only if for u-almost all x 


F(x) = g(T(a))| det VT (x)|. 


This is a highly non-linear equation in T, nowadays known as a particular case of a 
family of partial differential equations called Monge—Ampeére equations. More than 
two centuries after the work of Monge, Caffarelli [32] cleverly used the theory of 
Monge—Ampeéere equations to show smoothness of transport maps (see Sect. 1.6.4). 

As mentioned above, if u = d{xo} is a Dirac measure and v is not, then no 
transport maps from u to v can exist, because the mass at xo must be sent to a unique 
point xo. In 1942, Kantorovich [77] proposed a relaxation of Monge’s problem in 
which mass can be split. In other words, for each point x € 2 one constructs a 
probability measure uy that describes how the mass at x is split among different 
destinations. If uy is a Dirac measure at some y, then all the mass at x is sent to y. 
The formal mathematical object to represent this idea is a probability measure 7 on 
the product space 2 x Y (which is 2? in our particular setting). Here 2(A x B) 
is the amount of sand transported from the subset A C 2 into the part of the pit 
represented by B C Y. The total mass sent from A is 7(A x 2), and the total mass 
sent into B is 1(2 x B). Thus, 7 is mass-preserving if and only if 


m(Ax Y) =u (A), ACK Borel; 


n(X x B) = v(B), BCY Borel. oe 


Probability measures satisfying (1.2) will be called transference plans, and the set of 
those will be denoted by T (u,v). We also say that is a coupling of u and v, and 
that u and v are the first and second marginal distributions, or simply marginals, of 
m. The total cost associated with a € IT(U, Vv) is 


C(x) = feb») dna. 


In our setting of a complete separable metric space 2’, one can represent 7 as a col- 
lection of probability measures {7x }xe g on Y, in the sense that for all measurable 
nonnegative g 


Jorat y)dn(x,y) =f E g(x,y) dax(y j| aua ). 


The collection {7x} is that of the conditional distributions, and the iteration of inte- 
grals is called disintegration. For proofs of existence of conditional distributions, 
one can consult Dudley [47, Section 10.2] or Kallenberg [76, Chapter 5]. Con- 
versely, the measure u and the collection {my} determine m uniquely by choosing 
g to be indicator functions. An interpretation of these notions in terms of random 
variables will be given in Sect. 1.2. 
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The Kantorovich problem is to find the best transference plan, that is, to solve 


inf C(z). 
meIT (u,v) 


The Kantorovich problem is a relaxation of the Monge problem, because to each 
transport map T one can associate a transference plan 7 = 7r of the same total cost. 
To see this, choose the conditional distribution 7, to be a Dirac at T(x). Disintegra- 
tion then yields 


ca)= f, eant) f, | ff esamo) a= f e Tout) 


This choice of 7 satisfies (1.2) because 2(A x B) = u(ANT~!(B)) and v(B) = 
u(T~!(B)) for all Borel A C 2 andBCY. 

Compared to the Monge problem, the relaxed problem has considerable advan- 
tages. Firstly, the set of transference plans is never empty: it always contains the 
product measure u & v defined by [u @ v](A) = u(A)v (B). Secondly, both the ob- 
jective function C(z) and the constraints (1.2) are linear in 7, so the problem can be 
seen as infinite-dimensional linear programming. To be precise, we need to endow 
the space of measures with a linear structure, and this is done in the standard way: 
define the space M(X ) of all finite signed Borel measures on X. This is a vector 
space with (u1 + &u2)(A) = U1 (A) + &u2(A) fora € R, u1, U2 EM( 2) and A C X 
Borel. The set of probability measures on 2 is denoted by P(X), and is a convex 
subset of M(X). The set TI (u,v) is then a convex subset of P(X xY), and as C(z) 
is linear in 7, the set of minimisers is a convex subset of TI (u,v). Thirdly, there is 
a natural symmetry between TI (u,v) and H(v, u). If a belongs to the former and 
we define (B x A) = 7(A x B), then # € IT(v, u). If we set č(y,x) = c(x,y), then 


ca)= f œde) =f) ely.x) dit(y,2) = ČA). 


In particular, when 2 = Y and c = č is symmetric (as in (1.1)), 


inf C(x)=_ inf Č(ř), 
neN(u,v) ãeENM(v,u) 


and m € IT(u,v) is optimal if and only if its natural counterpart is optimal in 
TI (v, u). This symmetry will be fundamental in the definition of the Wasserstein 
distances in Chap. 2. 

Perhaps most importantly, a minimiser for the Kantorovich problem exists under 
weak conditions. In order to show this, we first recall some definitions. Let C,(.2") 
be the space of real-valued, continuous bounded functions on 2. A sequence of 
probability measures {4n} E€ M(X) is said to converge weakly? to y E€ M(X) 
if for all f € (X), ffdu, —> f fdu. To avoid confusion with other types of 
convergence, we will usually write Un — u weakly; in the rare cases where a symbol 


2 Weak convergence is sometimes called narrow convergence, weak* convergence, or convergence 
in distribution. 
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is needed we shall use the notation U, “> u. Of course, if U, —> u weakly and 
Un E P(X), then u must be in P(X) too (this is seen by taking f = 1 and by 
observing that f fdu > Oif f > 0). 

A collection of probability measures .% is tight if for all € > 0 there exists a com- 
pact set K such that infy<.y U(K) > 1—e. If X is represented by a sequence {Hn }, 
then Prokhorov’s theorem (Billingsley [24, Theorem 5.1]) states that a subsequence 
of {un} must converge weakly to some probability measure LU. 

We are now ready to show that the Kantorovich problem admits a solution when c 
is continuous and nonnegative and .2 and Y are complete separable metric spaces. 
Let {7} be a minimising sequence for C. Then, according to [24, Theorem 1.3], 
U and v must be tight. If Kı and Kz are compact with u(K)),v(K2) > 1— €, then 
K x Ky is compact and for all 2 € TI (u,v), (Ki x K2) > 1—2e. It follows that the 
entire collection T (u,v) is tight, and by Prokhorov’s theorem 7, has a weak limit 
nm after extraction of a subsequence. For any integer K, cx(x,y) = min(c(x,y), K) is 
a continuous bounded function, and 


C(m) = felx.y)dmi(xy) > fex(sy)dmley) > fered), n>a. 
By the monotone convergence theorem 


liminfC (7) > jim fex) dz(x,y) = C(7) if m, > m weakly. (1.3) 
noo —poo 
Since {7} was chosen as a minimising sequence for C, 2 must be a minimiser, and 
existence is established. 

As we have seen, the Kantorovich problem is a relaxation of the Monge problem, 
in the sense that 


inf C(T)= inf C(a)> inf C(x) =C(x* 
„if C0) = inf Ca) > inf | C(m) =C(a"), 


for some optimal 2*. If 2* = mr for some transport map T, then we say that the 
solution is induced from a transport map. This will happen in two different and 
important cases that are discussed in Sects. 1.3 and 1.6.1. 

A remark about terminology is in order. Many authors talk about the Monge— 
Kantorovich problem or the optimal transport(ation) problem. More often than not, 
they refer to what we call here the Kantorovich problem. When one of the scenarios 
presented in Sects. 1.3 and 1.6.1 is considered, this does not result in ambiguity. 


1.2 Probabilistic Interpretation 


The preceding section was an analytic presentation of the Monge and the Kan- 
torovich problems. It is illuminating, however, to also recast things in probabilistic 
terms, and this is the topic of this section. 
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A random element on a complete separable metric space (or any topological 
space) 2 is simply a measurable function X from some (generic) probability space 
(Q,F,P) to X (with its Borel o-algebra). The probability law (or probability dis- 
tribution) is the probability measure x = X#P defined on the space .2’; this is the 
Borel measure satisfying Ux(A) = P(X € A) for all Borel sets A. 

Suppose that one is given two random elements X and Y taking values in 2 and 
Y, respectively, and a cost function c: 2 x Y — R. The Monge problem is to find 
a measurable function T such that T(X) has the same distribution as Y, and such 
that the expectation 


c(r) = f cT u(x) = f elX(@),7(X())]4P(@) = Ee, TX) 


is minimised. 

The Kantorovich problem is to find a joint distribution for the pair (X,Y) whose 
marginals are the original distributions of X and Y, respectively, and such that the 
probability law 2 = (X,Y )#P minimises the expectation 


C(x) = J, „Ei = | e(X(a),¥(@))]4P(@) = Exe(X,Y). 


Any such joint distribution is called a coupling of X and Y. Of course, (X,7T(X)) is a 
coupling when T (X) has the same distribution as Y. The measures 7, in the previous 
section are then interpreted as the conditional distribution of Y given X = x. 
Consider now the important case where X = Y = Rf, c(x,y) = ||x— y||?, and 
X and Y are square integrable random vectors (E||X ||? + E||Y||? < ~). Let A and B 
be the covariance matrices of X and Y, respectively, and notice that the covariance 
AV 
V' B 


matrix of a coupling 7 must have the form C = ) for ad x d matrix V. The 


covariance matrix of the difference X — Y is 


AV\ (I 
(Ia —I) o :) eG =A+B-V'-V 


so that 


ine(X,Y) =Eql|X —¥||? = [EX -EY |? +tr[4 +8- V' — V]. 


Since only V depends on the coupling 7, the problem is equivalent to that of max- 
imising the trace of V, the cross-covariance matrix between X and Y. This must be 
done subject to the constraint that a coupling 7 with covariance matrix C exists; in 
particular, C has to be positive semidefinite. 


1.3 The Discrete Uniform Case T 


1.3 The Discrete Uniform Case 


There is a special case in which the Monge—Kantorovich problem reduces to a finite 
combinatorial problem. Although it may seem at first hand as an oversimplification 
of the original problem, it is of importance in practice because arbitrary measures 
can be approximated by discrete measures by means of the strong law of large num- 
bers. Moreover, the discrete case is important in theory as well, as a motivating 
example for the Kantorovich duality (Sect. 1.4) and the property of cyclical mono- 
tonicity (Sect. 1.7). 
Suppose that u and v are each uniform on n distinct points: 


u=}, v= (öp): 


The only relevant costs are c;j = c(x;, yj), the collection of which can be represented 
by an n x n matrix C. Transport maps T are associated with permutations in Sn, the 
set of all bijective functions from {1,...,} to itself: given o € S,, a transport map 
can be constructed by defining T (x;) = Yo(i): If © is not a permutation, then T will 
not be a transport map from u to v. Transference plans 7 are equivalent to n x n 
matrices M with coordinates Mj; = m({(xi,y;)}) = Mij; this is the amount of mass 
sent from x; to yj. In order for 7 to a be a transference plan, it must be that X}, jMij = 
1/n for all i and };Mj; = 1/n for all j, and in addition M must be nonnegative. In 
other words, the matrix M’ = nM belongs to B,, the set of bistochastic matrices of 
order n, defined as n x n matrices M’ satisfying 


n us 
2 Mj=l, i=l, È Mj=1l, j=1,.3m; My20. 
= i=l 


The Monge problem is the combinatorial optimisation problem over permutations 


n 


1 
inf C(o) = -= inf Vico, 


OESy N O€Sp par 


and the Kantorovich problem is the linear program 


n n 
eoa 2 cijMij 7 wet » cu E jo 
If o is a permutation, then one can define M = M(o) by Mij = 1/n if j = o (i) and 
0 otherwise. Then M € B,/n and C(M) = C(o). Such M (or, more precisely, nM) is 
called a permutation matrix. 

The Kantorovich problem is a linear program with n? variables and 2n con- 
straints. It must have a solution because B, (hence B,,/n) is a compact (nonempty) 
set in R”? and the objective function is linear in the matrix elements, hence contin- 
uous. (This property is independent of the possibly infinite-dimensional spaces 2 
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and Y in which the points lie.) The Monge problem also admits a solution because 
Sn is a finite set. To see that the two problems are essentially the same, we need to 
introduce the following notion. If B is a convex set, then x € B is an extremal point 
of B if it cannot be written as a convex combination tz + (1 — t)y for some distinct 
points y,z € B. It is well known (Luenberger and Ye [89, Section 2.5]) that there 
exists an optimal solution that is extremal, so that it becomes relevant to identify 
the extremal points of B,. It is fairly clear that each permutation matrix is extremal 
in B,; the less obvious converse is known as Birkhoff’s theorem, a proof of which 
can be found, for instance, at the end of the introduction in Villani [124] or (in a 
different terminology) in Luenberger and Ye [89, Section 6.5]. Thus, we have: 


Proposition 1.3.1 (Solution of Discrete Problem) There exists © € S, such that 
M(o) minimises C(M) over B,/n. Furthermore, if {0),...,0,} is the set of opti- 
mal permutations, then the set of optimal matrices is the convex hull of {M(01),..., 
M(ox)}. In particular, if © is the unique optimal permutation, then M(o) is the 
unique optimal matrix. 


Thus, in the discrete case, the Monge and the Kantorovich problems coincide. One 
can of course use the simplex method [89, Chapter 3] to solve the linear program, 
but there are n! vertices, and there is in principle no guarantee that the simplex 
method solves the problem efficiently. However, the constraints matrix has a very 
specific form (it contains only zeroes and ones, and is totally unimodular), so spe- 
cialised algorithms for this problem exist. One of them is the Hungarian algorithm 
of Kuhn [85] or its variant of Munkres [96] that has a worst-case computational 
complexity of at most O(n*). Another alternative is the class of net flow algorithms 
described in [89, Chapter 6]. In particular, the algorithm of Edmonds and Karp [50] 
has a complexity of at most O(n? logn). This monograph does not focus on compu- 
tational aspects for optimal transport. This is a fascinating and very active area of 
contemporary research, and readers are directed to Peyré and Cuturi [103]. 


Remark 1.3.2 The special case described here could have been more precisely 
called “the discrete uniform case on the same number of points”, as “the discrete 
case” could refer to any two finitely supported measures u and v. In the Monge 
context, the setup discussed here is the most interesting case, see page 8 in the sup- 
plement for more details. 


1.4 Kantorovich Duality 


The discrete case of Sect. 1.3 is an example of a linear program and thus enjoys a 
rich duality theory (Luenberger and Ye [89, Chapter 4]). The general Kantorovich 
problem is an infinite-dimensional linear program, and under mild assumptions ad- 
mits similar duality. 
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1.4.1 Duality in the Discrete Uniform Case 


; . 2 ‘ 
We can represent any matrix M as a vector in R” , say M, by enumeration of the 
elements row by row. If nM is bistochastic, i.e., M € B,/n, then the 2n constraints 
can be represented in a (2n) x n? matrix A. For instance, if n = 3, then 


111 


For general n, the constraints read AM = n7!(1,...,1) € R?” and A takes the form 


1, 
A= 7 eR 14, =(1,..., 1) ER", 


with J, the n x n identity matrix. Thus, the problem can be written 
1 
min C'M subjectto AM=-(l,...,1)€R”; M>0. 
n 


The last constraint is to be interpreted coordinate-wise; all the elements of M must 
be nonnegative. The dual problem is constructed by introducing one variable for 
each row of A, transposing the constraint matrix and interchanging the roles of the 
objective vector C and the constraints vector b = n~!(1,...,1). Call the new vari- 
ables pj,...,Pn and q1,...,gn, and notice that each column of A corresponds to 
exactly one p; and one q;, and that the n? columns exhaust all possibilities. Hence, 
the dual problem is 


n 1 n 
max vC)=i$a+i day subject to pitqj<cij, i,j=1,...,n. 
i=l j=l 


p.qeR" 
(1.4) 
In the context of duality, one uses the terminology primal problem for the original 
optimisation problem. Weak duality states that if M and (p,q) satisfy the respective 
constraints, then 


l l 
b @ =} piz + Lai =È (Pi+q;)Mij SY CMij = CM. 
f J J 


ij 
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In particular, if equality holds, then M is primal optimal and (p,q) is dual optimal. 
Strong duality is the nontrivial assertion that there exist M* and (p*,q*) satisfying 
C’M* = bi (2). 


1.4.2 Duality in the General Case 


The vectors C and M were obtained from the cost function c and the transference 
plan 7 as C;j =c(xi,y;) and Mj; = n ({ (xi, y;) }). Similarly, we can view the vectors p 
and q as restrictions of functions g: 2 — Rand y : Y > R of the form p; = (xi) 
and q; = w(y;). The constraint vector b = (1,,1,) can be written as b; = u ({x;}) and 
bn; = V({y;}). In this formulation, the constraint p; +q; < cij writes (Q, y) € ®e 
with 


®. = {(9,W) € Li (u) x Li(v) : p(x) + wy) < c(x,y) for all x,y}, 


and the dual problem (1.4) becomes 


sup i (x) du(x) + f wy) avy) subjectto (p, y) € ®©.. 
(Wiel (u) xL (v) LY y 


Simple measure theory shows that the set constraints (1.2) defining the transference 
plans set T (u,v) are equivalent to functional constraints. For future reference, we 
state this formally as: 


Lemma 1.4.1 (Functional Constraints) Let u and v be probability measures. Then 
m € II(u,Vv) if and only if for all integrable functions ọ € Lı (u), y € Ly (v), 


J, Otvor) =f odu f wo)avo. 


The proof follows from the fact that (1.2) yields the above equality when ọ and y 
are indicator functions. One then uses linearity and approximations to deduce the 
result. 

Weak duality follows immediately from Lemma 1.4.1. For if 7 € H(u,v) and 
(p, y) € ®,, then 


f oouo f vovavo) = fo) +yz) < C(x). 


Strong duality can be stated in the following form: 


Theorem 1.4.2 (Kantorovich Duality) Let and v be probability measures on 
complete separable metric spaces & and Y, respectively, and let c: X x Y > R+, 
be a measurable function. Then 


1.5 The One-Dimensional Case 11 


inf i cdz= sup f eau f way]. 
mel] (Uv) I ZXY 


(p, W)E®- 


See the Bibliographical Notes for other versions of the duality. 

When the cost function is continuous, or more generally, a countable supremum 
of continuous functions, the infimum is attained (see (1.3)). The existence of max- 
imisers (@, y) is more delicate and requires a finiteness condition, as formulated in 
Proposition 1.8.1 below. 

The next sections are dedicated to more concrete examples that will be used 
through the rest of the book. 


1.5 The One-Dimensional Case 


When 2 = Y = R, the Monge—Kantorovich problem has a particularly simple 
structure, because the class of “nice” transport maps contains at most a single el- 
ement. Identify u,v € P(R) with their cumulative distribution functions F and G 
defined by 


F()=u((-=,1),  G)=v((-~,4)), eR. 


Let the cost function be (momentarily) quadratic: c(x,y) = |x — y|?/2. Since for 
x1 < X2, Y1 Sy2 


c(y2,x1) + c(y1,x2) — c(y1,x1) — c(y2,*2) = (x2 — x1)(y2 — y1) = 0, 


it seems natural to expect the optimal transport map to be monotonically increasing. 
It turns out that, on the real line, there is at most one such transport map: if T is 
increasing and T#u = v, then for allt € R 


G(t) = v((—%,t]) = u((=, T7! (£)]) = F(T"). 


If £ = T(x), then the above equation reduces to T(x) = G7! (F (x)). This formula 
determines T uniquely, and has an interesting probabilistic interpretation: it is well- 
known that if X is a random variable with continuous distribution function F, then 
F(X) follows a uniform distribution on (0,1). Conversely, if U follows a uniform 
distribution, G is any distribution function, and 


G`! (u) = inf G7! ([u,1]) = inf{x E€ R : G(x) >u}, O<uK<l, 


is the quantile function of X, then the random variable G7! (U) has distribution func- 
tion G. We say that G7! is the left-continuous inverse of G. In terms of push-forward 
maps, we can write F#u = Leb|jp,ı] and G~!#Leb|/9,1 = v, with Leb standing for 
Lebesgue measure, and it is restricted to the interval [0,1]. Consequently, if F is 
continuous and G is arbitrary, then T#u = v; we can view T as pushing u forward 
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to v in two steps: firstly, u is pushed forward to Leb|io,1 and secondly, Leb|io,1 is 
pushed forward to v. 
Using the change of variables formula, we see that the total cost of T is 


T)= f IFG) -xP du) =f |G} (u) — F~! (u))|? du. 


If F is discontinuous, then F#u is not Lebesgue measure, and T is not necessar- 
ily defined. But there will exist an optimal transference plan a € H (u,v) that is 
monotone in the following sense: there exists a set lT C R? such that 2(I”) = 1 and 
whenever (x;,y;) ET, 


yo — 21? + [y1 = x2 — Iyı — x11? — [v2 — x2? > 0. 


Thus, mass at x; and x2 can be split if need be, but in a monotone way. For example, 
if u puts mass 1/2 at x; = —1 and at x2 = 1 and v is uniform on [—1, 1]. Then the 
transference plan spreads the mass of x; uniformly on [—1,0], and the mass of x2 
uniformly on [0, 1]. This is a particular case of the cyclical monotonicity that will be 
discussed in Sect. 1.7. 

Elementary calculations show that the inequality 


c(y2,%1) + e(91,%2) — (91,41) — C(y2,x2) > 0, Xp lx; ysy 


holds more generally than the quadratic cost c(x, y) = |x— y|?. Specifically, it suffices 
that c(x,y) = A(|x— y|) with A convex on R+. 

Since any distribution can be approximated by continuous distributions, in view 
of the above discussion, the following result from Villani [124, Theorem 2.18] 
should not be too surprising. 


Theorem 1.5.1 (Optimal Transport in R) Let u,v € P(R) with distribution func- 
tions F and G, respectively, and let the cost function be of the form c(x,y) = 
h(|x— y|) with h convex and nonnegative. Then 


inf C(x j= [ine h(G tajd 


zelM(u,v) 


If the infimum is finite and h is strictly convex, then the optimal transference plan is 
unique. Furthermore, if F is continuous, then the infimum is attained by the trans- 
port map T = G`! 


The prototypical choice for h is A(z) = |z|? with p > 1. This result allows in particu- 
lar a direct evaluation of the Wasserstein distances for measures on the real line (see 
Chap. 2). 

Note that no regularity is needed in order that the optimal transference plan be 
unique, unlike in higher dimensions (compare Theorem 1.8.2). The structure of so- 
lutions in the concave case (0 < p < 1) is more complicated, see McCann [94]. 
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When p = 1, the cost function is convex but not strictly so, and solutions will not 
be unique. However, the total cost in Theorem 1.5.1 admits another representation 
that is often more convenient. 


Proposition 1.5.2 (Quantiles and Distribution Functions) Zf F and G are distri- 
bution functions, then 


1 
f ew- rwu [ee - 


The proof is a simple application of Fubini’s theorem; see page 13 in the supple- 
ment. 


Corollary 1.5.3 If c(x,y) = 


inf C(a )= [16¢)- 
melT (u,v) 


1.6 Quadratic Cost 


This section is devoted to the specific cost function 


_ yyy2 
c(x,y) = Lee x,y E X, 
where 2 is a separable Hilbert space. This cost is popular in applications, and 
leads to a lucid and elegant theory. The factor of 1/2 does not affect the minimising 
coupling 7 and leads to cleaner expressions. (It does affect the optimal dual pair, but 
in an obvious way.) 


1.6.1 The Absolutely Continuous Case 


We begin with the Euclidean case, where 2 = = (Rf, ||- ||) is endowed with 
the Euclidean metric, and use the Kantorovich duality to obtain characterisations of 
optimal maps. 

Since the dual objective function to be maximised 


[eau f wav 


is increasing in @ and y, one should seek functions that take values as large as pos- 
sible subject to the constraint p(x) + w(y) < ||x—y||?/2. Suppose that an oracle tells 
us that some @ € Lı (u) is a good candidate. Then the largest possible y satisfying 


(p, y) E€ ® is 
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ar (Ble II? sng [P 
vo) = ing [PE - o0] = BE- int [AE (0) — (x99). 


In other words, 


x 2 
-y0)= sup [æy 8], p) = BE 


xeR4 2 


2 
go) = PE 


= (2). 
As a supremum over affine functions (in y), W enjoys some useful properties. We 
remind the reader that a function f : 2 —> RU {æ} is convex if f(tx+(1—t)y) < 
tf(x)+(1—t)f(y) for all x,y € X and t € [0,1]. It is lower semicontinuous if for 
allx€ Z, f(x) < liminf,_,, f(y). Affine functions are convex and lower semicon- 
tinuous, and it straightforward from the definitions that both convexity and lower 
semicontinuity are preserved under the supremum operation. Thus, the function y 
is convex and lower semicontinuous. In particular, it is Borel measurable due to the 
following characterisation: f is lower semicontinuous if and only if {x : f(x) < a} 
is a closed set for all œ € R. 

From the preceding subsection, we now know that optimal dual functions @ and 
y must take the form of the difference between ||- ||2/2 and a convex function. 
Given the vast wealth of knowledge | on convex functions (Rockafellar [113], it will 
be convenient to work with @ and y, and to assume that Y = (@)*, where 


f*(y) = sup [(xy)—fQ)], yER? 


xeR¢ 
is the Legendre transform of f ({113, Chapter 26]; [124, Chapter 2]), and is of 
fundamental importance in convex analysis. Now by symmetry, one can also replace 
Ọ by (y)* = (@)**, so it is reasonable to expect that an optimal dual pair should take 


the form (|| - ||7/2— @, || -17/2 — (@)*), with @ convex and lower semicontinuous. 
The alternative representation of the dual objective value as 


[eau [wav = z fall? du(x +5 [ lblÈavo) - [ pan- [wav 


is valid under the integrability condition 


[,lePaue + [lave < 


that u and v have finite second moments. This condition also guarantees that an 
optimal @ exists, as the conditions of Proposition 1.8.1 are satisfied. An alternative 
direct proof for the quadratic case can be found in Villani [124, Theorem 2.9]. 

Suppose that an optimal @ is found. What can we say about optimal transference 
plans 7? According to the duality, a necessary and sufficient condition is that 


2 
Lo H an dr(x,y) = [edu f way, 
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where y = || - ||7/2— (|| ||?/2 —@)*. Equivalently (using Lemma 1.4.1), 
J IG) + YO) = a)r) =0. (1.5) 
R¢xR¢ 


Hence, the integral vanishes if and only if z is concentrated on the set of (x,y) such 
that (x) + @*(y) = (x,y). By definition of the Legendre transform as a supremum, 
this happens if and only if the supremum defining @* (y) is attained at x; equivalently 


Since we have (x) + (@)*(y) > (x,y) everywhere, the integrand is nonnegative. 


G(z)- G(x) >e—xy), ZEX. 


This condition is precisely the definition of y being a subgradient of @ at x [113, 
Chapter 23]. When @ is differentiable at x, its unique subgradient is the gradient 
y = V@(x) [113, Theorem 25.1]. If we are fortunate and @ is differentiable every- 
where, or even -almost everywhere, then the optimal transference plan 7 is unique, 
and in fact induced from the transport map V@. The problem, of course, is that @ 
may fail to be differentiable u-almost surely. This is remedied by assuming some 
regularity on the source measure u in order to make sure that any convex func- 
tion be differentiable u-almost surely, and is done via the following regularity re- 
sult, which, roughly speaking, states that convex functions are differentiable almost 
surely. A stronger version is given in Rockafellar [113, Theorem 2.25], with an al- 
ternative proof in Alberti and Ambrosio [6, Chapter 2]. One could also combine the 
local Lipschitz property of convex functions [113, Chapter 10] with Rademacher’s 
theorem (Villani [125, Theorem 10.8]). 


Theorem 1.6.1 (Differentiability of Convex Functions) Let f : Rf + RU {œ} be 
a convex function with domain domf = {x € R! : f(x) < œ} and let N be the set 
of points at which f is not differentiable. Then N Ndomf has Lebesgue measure 0. 


Theorem 1.6.1 is usually stated for the interior of domf, denoted int(dom/), rather 
than the closure. But, since A = dom/ is convex, its boundary has Lebesgue measure 
zero. To see this assume first that A is bounded. If intA is empty, then A lies in a lower 
dimensional subspace [113, Theorem 2.4]. Otherwise, without loss of generality 
0 € intA, and then by convexity of A, dA C (1+ €£)A for all € > 0. When A is 
unbounded, write it as U„A N [—n,n]?. 

Another issue that might arise is that optimal ~’s might not exist. This is easily 
dealt with using Proposition 1.8.1. If we assume that and v have finite second 
moments: 


[ilxPaw) <2 ana IyIPavy <, 


then any transference plan 2 € IT(1, v) has a finite cost, as is seen from integrating 
the elementary inequality ||x— y||? < 2|]x||? + 2||y||? and using Lemma 1.4.1: 


cas f lke) f de f lavo) <=. 
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With these tools, we can now prove a fundamental existence and uniqueness result 
for the Monge—Kantorovich problem. It has been proven independently by several 
authors, including Brenier [31], Cuesta-Albertos and Matran [37], Knott and Smith 
[83], and Rachev and Riischendorf [117]. 


Theorem 1.6.2 (Quadratic Cost in Euclidean Spaces) Let u and v be probability 
measures on R¢ with finite second moments, and suppose that u is absolutely contin- 
uous with respect to Lebesgue measure. Then the solution to the Kantorovich prob- 
lem is unique, and is induced from a transport map T that equals u-almost surely the 
gradient of a convex function . Furthermore, the pair (||x||?/2— ¢, ||y||*/2 — 0*) 
is optimal for the dual problem. 


Proof. To alleviate the notation we write @ instead of @. By Proposition 1.8.1, there 
exists an optimal dual pair (~, y) such that @(x) = ||x||?/2 — @(x) is convex and 
lower semicontinuous, and by the discussion in Sect. 1.1, there exists an optimal m. 
Since @ is u-integrable, it must be finite almost everywhere, i.e., u(dom@) = 1. By 
Theorem 1.6.1, if we define .% as the set of nondifferentiability points of @, then 
Leb(.VW Mdom@) = 0; as u is absolutely continuous, the same holds for u. (Here 
Leb denotes Lebesgue measure.) 

We conclude that u (int(dom®@) \ 4) = 1. In other words, @ is differentiable u- 
almost everywhere, and so for u-almost any x, there exists a unique y such that 
o(x) + 6*(y) = (x,y), and y = Vọ (x). This shows that 7 is unique and induced 
from the transport map V@(x). The gradient V@ is Borel measurable, since each 
of its coordinates can be written as limsup,_.9 ,<@ gq '(¢(x+ qv) — 6(x)) for some 
vector v (the canonical basis of R“), which is Borel measurable because the limit 
superior is taken on countably many functions (and @ is measurable because it is 
lower semicontinuous). 


1.6.2 Separable Hilbert Spaces 


The finite-dimensionality of R? in the previous subsection was only used in order to 
apply Theorem 1.6.1, so one could hope to extend the results to infinite-dimensional 
separable Hilbert spaces. 

Although there is no obvious parallel for Lebesgue measure (i.e., translation in- 
variant) on infinite-dimensional Banach spaces, one can still define absolute con- 
tinuity via Gaussian measures. Indeed, u € P(IR®) is absolutely continuous with 
respect to Lebesgue measure if and only if the following holds: if “” C R? is such 
that v( 4) = 0 for any nondegenerate Gaussian measure v, then 1(.4”) = 0. This 
definition can be extended to any separable Banach space 2% via projections, as 
follows. Let 2™ be the (topological) dual of 2”, consisting of all real-valued, con- 
tinuous linear functionals on 2”. 


Definition 1.6.3 (Gaussian Measures) A probability measure u € P(X) is a non- 
degenerate Gaussian measure if for any L E€ 2*\ 40}, Gu € P(R) is a Gaussian 
measure with positive variance. 
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Definition 1.6.4 (Gaussian Null Sets and Absolutely Continuous Measures) A 
subset N C X is a Gaussian null set if whenever v is a nondegenerate Gaussian 
measure, V(.V) = 0. A probability measure u € P(X) is absolutely continuous if 
u vanishes on all Gaussian null sets. 


Clearly, if v is a nondegenerate Gaussian measure, then it is absolutely continuous. 

As explained in Ambrosio et al. [12, Section 6.2], a version of Rademacher’s 
theorem holds in separable Hilbert spaces: a locally Lipschitz function is Gateaux 
differentiable except on a Gaussian null set of 2°. Theorem 1.6.2 (and more gener- 
ally, Theorem 1.8.2) extend to infinite dimensions; see [12, Theorem 6.2.10]. 


1.6.3 The Gaussian Case 


Apart from the one-dimensional case of Sect. 1.5, there is another special case in 
which there is a unique and explicit solution to the Monge—Kantorovich problem. 

Suppose that u and v are Gaussian measures on R? with zero means and nonsin- 
gular covariance matrices A and B. By Theorem 1.6.2, we know that there exists a 
unique optimal map T such that T#u = v. Since linear push-forwards of Gaussians 
are Gaussian, it seems natural to guess that T should be linear, and this is indeed the 
case. 

Since T is a linear map that should be the gradient of a convex function @, it must 
be that ¢ is quadratic, i.e., @(x) — ¢ (0) = (x, Qx) for x € R? and some matrix Q. The 
gradient of @ at x is (Q + Q')x and the Hessian matrix is Q + Q*. Thus, T = Q + Q 
and since @ is convex, T must be positive semidefinite. 

Viewing T as a matrix leads to the Riccati equation TAT = B (since T is sym- 
metric). This is a quadratic equation in 7, and so we wish to take square roots in a 
way K would isolate T. This is done by multiplying the equation from both sides 
with Al/?: 


(Ara JA! 2TA"?] = Al TATĄ!/2 — A!/2pal/2 = Ree stele (swede? wie 


All matrices in brackets are positive semidefinite. By taking square roots and multi- 
plying with A7!/2, we finally find 


T = AT! JA BA"? PAT". 


A straightforward calculation shows that TAT = B indeed, and T is positive definite, 
hence optimal. To calculate the transport cost C(T), observe that (T —/)#u is a 
centred Gaussian measure with covariance matrix 


TAT — TA —AT +A =A+B—A!/?[A!/2BA?]1/24-1/2_ A71? (Al /2BAl/2]1/241/2. 


IfY ~.¥(0,C), then E||Y ||? equals the trace of C, denoted trC. Hence, by properties 
of the trace, 
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C(T) =tr [A+B —2(a"/?Ba"/2)1/2) (1.6) 


By continuity arguments, (1.6) is the total transport cost between any two Gaussian 
distributions with zero means, even if A is singular. 
If AB = BA, the above formulae simplify to 


T=B'2A7!2,  C(T) =u [A +B- 24"/2B"?] = |Al/2 — Bl/2)/2, 


with F the Frobenius norm. 
If the means of u and v are m and n, one simply needs to translate the measures. 
The optimal map and the total cost are then 


Tx=n+A—"V/2[al2paV2/2 4-1/2 (x m); 


C(T) = ||n — m||? + tr[A + B—2(A'/?BAl/?) 1/2), 


From this, we can deduce a lower bound on the total cost between any two mea- 
sures in Rf in terms of their second order structure. This is worth mentioning, be- 
cause such lower bounds are not very common (the Monge—Kantorovich problem 
is defined by an infimum, and thus typically easier to bound from above). 


Proposition 1.6.5 (Lower Bound for Quadratic Cost) Let u,v € P(R?) have 
means m and n and covariance matrices A and B and let T be the optimal map. 
Then 

C(x) > |n- ml? + tA + B- 2(A/?BAl?)!/2), 


Proof. It will be convenient here to use the probabilistic terminology of Sect. 1.2. 
Let X and Y be random variables with distributions u and v. Any coupling of X and 


') € R*4*4 for some matrix 


V € R@“@, constrained so that C is positive semidefinite. This gives the lower bound 


Y will have covariance matrix of the form C = ( 


inf E,\|X—Y||?=||m—n|?+ inf trg[A+B—2V] > ||m—nl|*+_ inf tr[A+B—2V]. 
melT (u,v) melT (u,v) V:C>0 


As we know from the Gaussian case, the last infimum is given by (1.6). 


1.6.4 Regularity of the Transport Maps 


The optimal transport map T between Gaussian measures on Rf is linear, so it is 
of course very smooth (analytic). The densities of Gaussian measures are analytic 
too, so that T inherits the regularity of u and v. Using the formula for T, one can 
show that a similar phenomenon takes place in the one-dimensional case. Though 
we do not have a formula for T at our disposal when u and v are general absolutely 
continuous measures on Rf, d > 2, it turns out that even in that case, T inherits the 
regularity of u and v if some convexity conditions are satisfied. 
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To guess what kind of results can be hoped for, let us first examine the case d = 1. 
Let F and G denote the distribution functions of and v, respectively. Suppose 
that G is continuously differentiable and that G’ > 0 on some open interval (finite 
or not) J such that v(/) = 1. Then the inverse function theorem says that G7! is 
also continuously differentiable. Recall that the support of a (Borel) probability 
measure u (denoted supp) is the smallest closed set K such that u(K) = 1. A 
simple application of the chain rule (see page 19 in the supplement) gives: 


Theorem 1.6.6 (Regularity in R) Let u,v € P(R) possess distribution functions 
F and G of class C*, k > 1. Suppose further that suppv is an interval I (possibly 
unbounded) and that G' > 0 on the interior of I. Then the optimal map is of class C 
as well. If F,G € C? are merely continuous, then so is the optimal map. 


The assumption on the support of v is important: if u is Lebesgue measure on (0, 1] 
and the support of v is disconnected, then T cannot even be continuous, no matter 
how smooth v is. 

The argument above cannot be easily extended to measures on R¢, d > 2, because 
there is no explicit formula available for the optimal maps. As before, we cannot 
expect the optimal map to be continuous if the support of v is disconnected. It turns 
out that the condition on the support of v is not connectedness, but rather convexity. 
This was shown by Caffarelli, who was able to prove ([32] and the references within) 
that the optimal maps have the same smoothness as the measures. To state the result, 
we recall the following notation for an open Q C Rf, k > 0 and æ € (0, 1]. We say 
that f € CH% (Q) if all the partial derivatives of order k of f are locally œ-Hölder 
on Q. For example, if k = 1, this means that for any x € Q there exists a constant L 
and an open ball B containing x such that 


IVf) -YO < Lly- zl”, y,zE€B. 


Note that f € C+! —> f e CHB — f e Œt — fe œ, fr0<a <B <1 so 
æ gives a “fractional” degree of smoothness for f. Moreover, C%° = C% and C*! is 
quite close to C*+!, since Lipschitz functions are almost surely differentiable. 


Theorem 1.6.7 (Regularity of Transport Maps) Fix open sets Q1, Q C Rf, with 
Q, convex, and absolutely continuous measures u,v € P(IR¢) with finite sec- 
ond moments and bounded, strictly positive densities f,g, respectively, such that 
uU(Q1) = 1 = v(Q2). Let @ be such that Vo#u = v. 


1. If Q, and Q are bounded and f ,g are bounded below, then @ is strictly convex 
and of class C!s*(Q)) for some a > 0. 
2. If Qi = Q, =R? and f,g € C°, then ọ € C>% (Q1). 


If in addition f,g € C&, then @ € C> (Q1). 


In other words, the optimal map T = Vd € C*+!:“(Q,) is one derivative smoother 
than the densities, so has the same smoothness as the measures u,v. 

Theorem 1.6.7 will be used in two ways in this book. Firstly, it is used to derive 
criteria for a Karcher mean of a collection of measures to be the Fréchet mean of that 
collection (Theorem 3.1.15). Secondly, it allows one to obtain very smooth estimates 
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for the transport maps. Indeed, any two measures u and v can be approximated by 
measures satisfying the second condition: one can approximate them by discrete 
measures using the law of large numbers and then employ a convolution with, e.g., 
a Gaussian measure (see, for instance, Theorem 2.2.7). It is not obvious that the 
transport maps between the approximations converge to the transport maps between 
the original measures, but we will see this to be true in the next section. 


1.7 Stability of Solutions Under Weak Convergence 


In this section, we discuss the behaviour of the solution to the Monge—Kantorovich 
problem when the measures u and v are replaced by approximations Un and Vp. 
Since any measure can be approximated by discrete measures or by smooth mea- 
sures, this allows us to benefit from both worlds. On the one hand, approximating u 
and v with discrete measures leads to the finite discrete problem of Sect. 1.3 that can 
be solved exactly. On the other hand, approximating u and v with Gaussian convo- 
lutions thereof leads to very smooth measures (at least on IR“), and so the regularity 
results of the previous section imply that the respective optimal maps will also be 
smooth. Finally, in applications, one would almost always observe the measures of 
interest u and v with a certain amount of noise, and it is therefore of interest to 
control the error introduced by the noise. In image analysis, can represent an im- 
age that has undergone blurring, or some other perturbation (Amit et al. [13]). In 
other applications, the noise could be due to sampling variation, where instead of u 
one observes a discrete measure Uy obtained from realisations X),..., Xy of random 
elements with distribution u as Uy = N~! $^] 6{X;} (see Chap. 4). 

In Sect. 1.7.1, we will see that the optimal transference plan 7 depends continu- 
ously on u and v. With this result under one’s belt, one can then deduce an analo- 
gous property for the optimal map T from u to v given some regularity of u, as will 
be seen in Sect. 1.7.2. 

We shall assume throughout this section that u, —> u and v, — v weakly, which, 
we recall, means that fy f dln —> fa fdu for all continuous bounded f : X —> R. 
The following equivalent definitions for weak convergence will be used not only in 
this section, but elsewhere as well. 


Lemma 1.7.1 (Portmanteau) Let 2 be a complete separable metric space and let 
U, Un E P(X). Then the following are equivalent: 


Un > U weakly; 

F,, (x) > F (x) for any continuity point x of F. Here X = R4, F, is the distribution 
function of Un and F is that of u; 

for any open GC X, liminf u,(G) > u (G); 

for any closed F C X, limsupu,(F) < u(F); 

fhdun —> fhdu for any bounded measurable h whose set of discontinuity points 
is a u-null set. 
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For a proof, see, for instance, Billingsley [24, Theorem 2.1]. The equivalence with 
the last condition can be found in Pollard [104, Section II.2]. 


1.7.1 Stability of Transference Plans and Cyclical Monotonicity 


In this subsection, we state and sketch the proof of the fact that if Un — u and 
Vn — V weakly, then the optimal transference plans 7, € H (Un, Vn) converge to an 
optimal 7 € IT(u,v). The result, as stated in Villani [125, Theorem 5.20], is valid on 
complete separate metric spaces with general cost functions, and reads as follows. 


Theorem 1.7.2 (Weak Convergence and Optimal Plans) Let U„ and V, converge 
weakly to u and v, respectively, in P(X) and let c : X? — R, be continuous. If 
Tn € T (Un, Vn) are optimal transference plans and 


lim sup cy) dmn (x,y) < 


n=% 
then (Tn) is a tight sequence and each of its weak limits n € TI (u,v) is optimal. 


One can even let c vary with n under some conditions. 

Let c(x,y) = ||x—y||?/2. We prefer to keep the notation c(-,-) in order to stress 
the generality of the arguments. A key idea in the proof is the replacement of opti- 
mality by another property called cyclical monotonicity, which behaves nicely with 
respect to weak convergence. To motivate this property, we recall the discrete case 
of Sect. 1.3 where u = NT! 5}; 5{x;} and v = N7! EY; 5{y;}. There exists an op- 
timal transference plan 7 induced from a permutation 09 € Sy. Since the ordering 
of {x;} and {y;} is irrelevant in the representations of u and v, we may assume 
without loss of generality that Oo is the identity permutation. Then, by definition of 
optimality, 


N N 
2 enyi) < de c(i Yol)  F ESN. (1.7) 
If o is the identity except for a subset i),...,i,, n < N, then in particular 


n n 
X Gia) < DY clip Yoriz))> O €E Sn, 
k=1 k=1 


and if we choose o (ip) = i¢—1 with io = in, this writes 


n 


¥ cGy) S X c(xio Yia) (1.8) 


k=1 


Sa 
II 
= 


By decomposing a permutation o € Sy to disjoint cycles, one can verify that (1.8) 
implies (1.7). This will be useful since, as it turns out, a variant of (1.8) holds for 
arbitrary measures u and v for which there is no relevant finite N as in (1.7). 
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Definition 1.7.3 (Cyclically Monotone Sets and Measures) A set T C X 2 is 
cyclically monotone if for any n and any (x,,y1),---;(Xn,Yn) ET, 


n 


Yeas È c(xiyi-1), (yo = Yn). (1.9) 


i=1 i=1 


A probability measure n on X? is cyclically monotone if there exists a monotone 
Borel set T such that n(T) = 1. 


The relevance of cyclical monotonicity becomes clear from the following obser- 
vation. If u and v are discrete uniform measures on N points and o is an op- 
timal permutation for the Monge—Kantorovich problem, then the coupling 7 = 
(iN) ye 6{(xi,¥o(i))} is cyclically monotone. In fact, even if the optimal per- 
mutation is not unique, the set 


T= {(%i, yo) i= 1,...,N, 0 € Sy optimal} 


is cyclically monotone. Furthermore, m € IT(u,v) is optimal if and only if it is 
cyclically monotone, if and only if z(I”) = 1. It is heuristically easy to see that 
cyclical monotonicity is a necessary condition for optimality: 


Proposition 1.7.4 (Optimal Plans Are Cyclically Monotone) Let u,v € P(X) 
and suppose that the cost function c is nonnegative and continuous. Assume that 
the optimal n € TI (u,v) has a finite total cost. Then suppr is cyclically monotone. 
In particular, 7 is cyclically monotone. 


The idea of the proof is that if for some (x1,y1),---, (Xn, Yn) in the support of 7, 


n n 
Yc.) > > c(xi,¥i-1); 
i=l i=l 


then by continuity of c, the same inequality holds on some balls of positive mea- 
sure. One can then replace 2 by a measure having (x;,y;—1) rather than (x;, y;) in its 
support, and this measure will incur a lower cost than 7. A rigorous proof can be 
found in Gangbo and McCann [59, Theorem 2.3]. 

Thus, optimal transference plans m% solve infinitely many discrete Monge- 
Kantorovich problems emanating from their support. More precisely, for any fi- 
nite collection (x;,y;) € suppr, i = 1,...,N and any permutation o € Sy, (1.7) 
is satisfied. Therefore, the identity permutation is optimal between the measures 
(1/N)¥.8{xj} and (1/N) Z ô{y;}. 

In the same spirit as I” defined above for the discrete case, one can strengthen 
Proposition 1.7.4 and prove existence of a cyclically monotone set I" that includes 
the support of any optimal transference plan 7: take l = Usupp(z) for 7 optimal. 

The converse of Proposition 1.7.4 also holds. 


Proposition 1.7.5 (Cyclically Monotone Plans Are Optimal) Let u,v € P(X), 
c: X? — R, continuous and T € TI (u,v) a cyclically monotone measure with 
C(z) finite. Then T is optimal in TI (u,v). 
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Let us sketch the proof in the quadratic case c(x,y) = ||x—y||?/2 and see how con- 
vexity comes into play. Straightforward algebra shows that (1.9) is equivalent, in the 
quadratic case, to 


n 


Y Orxa) L0, (Xa = 1). (1.10) 
i=l 


Fix (xo, yo) € T = suppz and define @ : 2 — RU {co} by 


o (x) = sup { (y0, x1 — xo) peep (Ym—1;Xm —Xm—1) 
+(¥m,X— Xm): MEN, (xiy) ET}. 


This function is defined as a supremum of affine functions, and is therefore convex 
and lower semicontinuous. Cyclical monotonicity of implies that ¢ (xo) = 0, so @ 
is not identically infinite (it would have been so if I were not cyclically monotone). 
Straightforward computations show that I" is included in the subdifferential of @: y 
is a subgradient of @ at x when (x,y) € I. Optimality of 2 then follows by weak 
duality, since 7 assigns full measure to the set of (x,y) such that ¢ (x) + ¢* (y) = 
(x,y); see (1.5) and the discussion around it. 

The argument for more general costs follows similar lines and is sketched at the 
end of this subsection. 

Given these intermediary results, it is now instructive to prove Theorem 1.7.2. 


Proof (Proof of Theorem 1.7.2). Since UL, — u weakly, it is a tight sequence, and 
similarly for v,,. Consequently, the entire set of plans U,,IT(Un,V,) is tight too (see 
the discussion before deriving (1.3)). Therefore, up to a subsequence, (7,) has a 
weak limit 2. We need to show that z is cyclically monotone and that C(z) is finite. 
The latter is easy, since cy (x,y) = min(M,c(x,y)) is continuous and bounded: 


C(x) = lim cu dx = lim lim cu dt < limint [ cdt, < œ. 
M> J 92 M->x0n-oo J g2 noo J Qe 

To show that z is cyclically monotone, fix (x1,y1),..-,(xv,yw) € suppr. We show 

that there exist (x/,y;) € supp, that converge to (xx, y;). Once this is established, 

we conclude from the cyclical monotonicity of supp, and the continuity of c that 


Maz 
mz 


N N 
C(xe, yk) = lim X ck yg) < lim Y cak yk) = ¥ cr y1). 
n-eoo k=1 n— 2 k=1 


> 
ll 


1 k=1 


The existence proof for the sequence is standard. For € > 0, let B = Be (xx, yx) be 
an open ball around (xk, yk). Then 2(B) > 0 and by the portmanteau Lemma 1.7.1, 
Tn (B) > 0 for sufficiently large n. It follows that there exist (x,y) € BO suppmn. 
Let £ = 1/m, say, then for all n > Nm we can find (x{,y¢) € supp of distance 
2/m from (xx, yk). We can choose Nm+1 > Nm without loss of generality in order to 
complete the proof. 
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A few remarks are in order. Firstly, quadratic cyclically monotone sets (with respect 
to ||x—y||?/2) are included in the subdifferential of convex functions. The converse 
is also true, as can be easily deduced from summing up the subgradient inequalities 


@(xi41) = Q (xi) + Oi, Xi41 — Xi), i=1,...,N, 


where y; is a subgradient of @ at x;. For future reference, we state this characterisa- 
tion as a theorem (which is valid in infinite dimensions too). 


Theorem 1.7.6 (Rockafellar [112]) A nonempty T C X? is quadratic cyclically 
monotone if and only if it is included in the graph of the subdifferential of a lower 
semicontinuous convex function that is not identically infinite. 


Secondly, we have not used at all the Kantorovich duality, merely its weak form. 
The machinery of cyclical monotonicity can be used in order to prove the duality 
Theorem 1.4.2. This is indeed the strategy of Villani [125, Chapter 5], who explains 
its advantage with respect to Hahn—Banach-type duality proofs. 

Lastly, the idea of the proof of Proposition 1.7.5 generalises to other costs in a 
natural way. Given a cyclically monotone (with respect to a cost function c) set I" 
and a fixed pair (xo, yo) € I, define (Riischendorf [116]) 


o(x)=inf{c(x1 ,yo)—c(xo, yo) a C(Xm, Ym-1)—C(Xm-1 „Ym-1) + C(X,¥m)—C(Xm Ym) } : 


Then under some conditions, (@, y) is dual optimal for some y. As explained in 
Sect. 1.8, y can be chosen to be essentially o° (as defined in that section). 


1.7.2 Stability of Transport Maps 


We now extend the weak convergence of m, to 7 of the previous subsection to con- 
vergence of optimal maps. Because of the applications we have in mind, we shall 
work exclusively in the Euclidean space 2 = R? with the quadratic cost function; 
our results can most likely be extended to more general situations. 

In this setting, we know that optimal plans are supported on graphs of subdif- 
ferentials of convex functions. Suppose that m, is induced by T, and 7 is induced 
by T. Then in some sense, the weak convergence of 7, to m yields convergence of 
the graphs of T, to the graph of T. Our goal is to strengthen this to uniform conver- 
gence of T, to T. Roughly speaking, we show the following: there exists a set A with 
L(A) = 1 and such that T, converge uniformly to T on every compact subset of A. 
For the reader’s convenience, we give a user-friendly version here; a more general 
statement is given in Proposition 1.7.11 below. 


Theorem 1.7.7 (Uniform Convergence of Optimal Maps) Let Un, u be absolutely 
continuous measures with finite second moments on an open convex set U C R? such 
that Un —> u weakly, and let v, > v weakly with Vn, Vv € P(IR¢) with finite second 
moments. If T, and T are continuous on U and C(T,) is bounded uniformly in n, 
then 
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sup || (x) — T (x) || — 0, n— ©, 
xEQ 


for any compact Q C U. 


Since T, and T are only defined up to Lebesgue null sets, it will be more convenient 
to work directly with the subgradients. That is, we view 7, and T as set-valued 
functions that to each x € R? assign a (possibly empty) subset of R°. In other words, 
T, and T take values in the power set of R¢, denoted by oR’ 

Let @ : R? — RU {æ} be convex, yı € 9¢ (x1) and yz € J¢ (x2). Putting n = 2 in 
the definition of cyclical monotonicity (1.10) gives 


(y2 — y1; X2 — x1) > 0. 


This property (which is weaker than cyclical monotonicity) is important enough to 
have its own name. Following the notation of Alberti and Ambrosio [6], we call 
a set-valued function (or multifunction) u : R? > QR" monotone if whenever y; € 
u(x;), i= 1,2, 

(y2 — Yi, — x1) > 0. 


If d = 1, this simply means that u is a nondecreasing (set-valued) function. For 
example, one can define u(x) = {0} for x € [0,1), u(1) = [0,1] and u(x) = 0 if 
x ¢ [0,1]. Next, u is said to be maximally monotone if no points can be added to its 
graph while preserving monotonicity: 


{(y —y,x’—x) >0 wheneveryeu(x)} => y €u(x'). 


It will be convenient to identify u with its graph; we will often write (x,y) € u to 
mean y € u(x). Note that u(x) can be empty, even when u is maximally monotone. 
The previous example for u is not maximally monotone, but it will be if we modify 
u(0) to be (—ce, 0] and u(1) to be [0,ce). 

Of course, if @ : R? — RU {æ} is convex, then u = 0@ is monotone. It fol- 
lows from Theorem 1.7.6 that u is maximally cyclically monotone (no points can be 
added to its graph while preserving cyclical monotonicity). It can actually be shown 
that u is maximally monotone [6, Section 7]. In what follows, we will always work 
with subdifferentials of convex functions, so unless stated otherwise, u will always 
be assumed maximally monotone. 

Maximally monotone functions enjoy the following very useful continuity prop- 
erty. It is proven in [6, Corollary 1.3] and will be used extensively below. 


Proposition 1.7.8 (Continuity at Singletons) Let x € R? such that u(x) = {y} is a 
singleton. Then u is nonempty on some neighbourhood of x and it is continuous at 
x: if Xn > x and yn € u(xXp), then yn > y. 


Notice that this result implies that if a convex function @ is differentiable on some 
open set E C Rf, then it is continuously differentiable there (Rockafellar [113, 
Corollary 25.5.1]). 
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If f : R > RU {œ} is any function, one can define its subgradient at x locally as 


Of (x) = ty: F) 2 F(a) + z= x) + o(|lz—-xl1)} 


ee l- Z S 


(See the discussion after Theorem 1.8.2.) When f is convex, one can remove the 
o(||z—x||) term and the inequality holds for all z, i.e., globally and not locally. Since 
monotonicity is related to convexity, it should not be surprising that monotonicity is 
in some sense a local property. Suppose that u(xo) = {yo} is a singleton and that for 
some y* € R, 


(y—y*,x— xo) > 0 


for all x € Rf and y € u(x). Then by maximality, y* must equal yọ. By “local prop- 
erty”, we mean that the conclusion y* = yo holds if the above inequality holds for x 
in a small neighbourhood of xo (an open set that includes xo). We will need a more 
general version of this result, replacing neighbourhoods by a weaker condition that 
can be related to Lebesgue points. The strengthening is somewhat technical; the 
reader can skip directly to Lemma 1.7.10 and assume that G is open without losing 
much intuition. 

Let B,(xo) = {x : ||x—xo|| < r} for r > 0 and xo € R°. The interior of a set G C R? 
is denoted by intG and the closure by G. If G is measurable, then LebG denotes the 
Lebesgue measure of G. Finally, convG denotes the convex hull of G. 

A point x9 is a Lebesgue point (or of Lebesgue density) of a measurable set 
G C R! if for any £ > 0 there exists tg > 0 such that 


Leb(B; (x0) NG) 
Leb(B;(xo)) 


>l-e, O0<t<te. 


An illuminating example is the set {y < ,/|x|} in R? (see Fig. 1.1). Since the “slope” 
of the square root is infinite, x9 = (0,0) is a Lebesgue point, but the fraction above 
is strictly smaller than one, for all t > 0. 


Fig. 1.1: The set G = {(x,y) : |x] < 1, —0.2 < y < y/ |x|} 


We denote the set of points of Lebesgue density of G by G**". Here are some 
facts about G”: clearly, intG C G*" C G. Stein and Shakarchi [121, Chapter 3, 
Corollary 1.5] show that Leb(G\ G”) = 0 (and Leb(G*" \ G) = 0, so G*" is very 
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close to G). By the Hahn—Banach theorem, G*" C int(conv(G)): indeed, if x is not 
in int(convG), then there is a separating hyperplane between x and convG > G, so 
the fraction above is at most 1/2 for all t > 0. 

The “denseness” of Lebesgue points is materialised in the following result. It is 
given as exercise in [121] when d = 1, and the proof can be found on page 27 in the 
supplement. 


Lemma 1.7.9 (Density Points and Distance) Let xo be a point of Lebesgue density 
of a measurable set G C R¢. Then 


5(z) = (2) = inf |z—x]| = o(lz—= xoll), as z> x0 


Of course, this result holds for any xo € G if the little o is replaced by big O, since 
6 is Lipschitz. When xo € intG, this is trivial because 6 vanishes on intG. 

The important part here is the following corollary: for almost all x € G, 6(z) = 
o(||z — ||) as z > x. This can be seen in other ways: since 6 is Lipschitz, it is 
differentiable almost everywhere. If x € G and 6 is differentiable at x, then V6(x) 
must be 0 (because 6 is minimised there), and then 6(z) = o(||z — x||). We just 
showed that 6 is differentiable with vanishing derivative at all Lebesgue points of x. 
The converse is not true: G = {+1/n}*_, has no Lebesgue points, but 5(y) < 4y? 
as y > 0. 

The locality of monotone functions can now be stated as follows. It is proven on 
page 27 of the supplement. 


Lemma 1.7.10 (Local Monotonicity) Let xo € R such that u(xo) = {yo} and xo 
is a Lebesgue point of a set G satisfying 


(y-y*,x-x0)>0 VxE GVyeu(x). 


Then y* = yo. In particular, the result is true if the inequality holds on G = O\ WY 
with Ø 4 O open and NV Lebesgue negligible. 


These continuity properties cannot be of much use unless u(x) is a singleton for 
reasonably many values of x. Fortunately, this is indeed the case: the set of points x 
such that u(x) contains more than one element has Lebesgue measure 0 (see Alberti 
and Ambrosio [6, Remark 2.3] for a stronger result). Another issue is that u may be 
empty, and convexity comes into play here again. Let domu = {x : u(x) Æ Ø}. Then 
there exists a convex closed set K such that 


intK C domu C K. 


[6, Corollary 1.3(2)]. Although domu itself may fail to be convex, it is almost con- 
vex in the above sense. By convexity, K \ intK has Lebesgue measure 0 (see the 
discussion after Theorem 1.6.1) and so the set of points in K where u is not a sin- 
gleton, 


{x € K : u(x) =0}U{x € K : u(x) contains more than one point}, 


28 1 Optimal Transport 


has Lebesgue measure 0, and u(x) is empty for all x ¢ K. (It is in fact not difficult to 
show that if x € OK, then u(x) cannot be a singleton, by the Hahn—Banach theorem.) 

With this background on monotone functions at our disposal, we are now ready 
to state the stability result for the optimal maps. We assume the following. 


Assumptions 1 Let Un, U, Vn,Vv € P(RÎ) with optimal couplings (with respect to 
quadratic cost) T, € (Un, Vn), m E€ H(u,v) and convex potentials 6, and @, re- 
spectively, such that 


e (convergence) Un — U and V, — V weakly; 
e (finiteness) the optimal couplings Tn € TI(Un, Vn) satisfy 


1 2 
limsup |, zllx— yl? dm (x,y) < e; 


e (unique limit) the optimal n € T (u,v) is unique. 
We further denote the subgradients 06, and 0 by un and u, respectively. 


These assumptions imply that 7 has a finite total cost. This can be shown by the 
liminf argument in the proof of Theorem 1.7.2 but also from the uniqueness of m. 
As a corollary of the uniqueness of 7, it follows that z,, —> 7 weakly; notice that this 
holds even if 7, is not unique for any n. We will now translate this weak convergence 
to convergence of the maximal monotone maps un to u, in the following form. 


Proposition 1.7.11 (Uniform Convergence of Optimal Maps) Let Assumptions 1 
hold true and denote E = suppu and E°°" the set of its Lebesgue points. Let Q be 
a compact subset of E” on which u is univalued (i.e., u(x) is a singleton for all 
x€ Q). Then un converges to u uniformly on Q: u(x) is nonempty for all x € Q 
and all n > Ng, and 


sup sup ||y—u(x)|| 3 0, n—> co, 
XEQ yEun (x) 


In particular, ifu is univalued throughout int(E) (so that @ € C! there), then uniform 
convergence holds for any compact Q C int(E). 


The proof of Proposition 1.7.11, given on page 28 of the supplement, follows two 
separate steps: 


e if asequence in the graph of u, converges, then the limit is in the graph of u; 
e sequences in the graph of un are bounded if the domain is bounded. 


Corollary 1.7.12 (Pointwise Convergence u-Almost Surely) Jf in addition u is 
absolutely continuous, then u(x) — u(x) -almost surely. 


Proof. We first claim that E C domu. Indeed, for any x € E and any € > 0, the ball 
B = B; (x) has positive measure. Consequently, u cannot be empty on the entire ball, 
because otherwise u (B) = 2(B x R?) would be 0. Since domu is almost convex (see 
the discussion before Assumptions 1), this implies that actually int(convE) C domu. 
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The rest is now easy: the set of points x € E for which Q = {x} fails to satisfy 
the conditions of Proposition 1.7.11 is included in 


(E \ E*") U {x € int(conv(E)) : u(x) contains more than one point}, 


which is u-negligible because u is absolutely continuous and both sets have 
Lebesgue measure 0. 


1.8 Complementary Slackness and More General Cost Functions 


It is well-known (Luenberger and Ye [89, Section 4.4]) that the solutions to the 
primal and dual problems are related to each other via complementary slackness. In 
other words, solution of one problem provides a lot of information about the solution 
of the other problem. Here, we show that this idea remains true for the Kantorovich 
primal and dual problems, extending the discussion in Sect. 1.6.1 to more general 
cost functions. 

Let 2 and Y be complete separable metric spaces, u E P(X), v € P(Y), and 
c: X xY — R, bea measurable cost function. 

If one finds functions (9, y) € ®, and a transference plan m € H(u,v) having 
the same objective values, then by weak duality (@, w) is optimal in ®, and 7 is 
optimal in T (u,v). Having the same objective values is equivalent to 


J, eœ) - 98) - vo)]da(x,y) =0 
BXy 
which is in turn equivalent to 

p(x) +y (y) = c(x,y), m-almost surely. 


It has already been established that there exists an optimal transference plan 7*. 
Assuming that C(2*) < œ% (otherwise all transference plans are optimal), a pair 
(p, y) € ®, is optimal if and only if 


(x) + wy) = c(x,y), m*-almost surely. 


Conversely, if (@o, Yo) is an optimal pair, then 7% is optimal if and only if it is con- 
centrated on the set 


{(x,y) : Po(x) + Yo(y) = ex, y)}- 


In particular, if for a given x there exists a unique y such that @p(x) + Wo(y) =c(x,y), 
then the mass at x must be sent entirely to y and not be split; if this is the case for u- 
almost all x, then this relation defines y as a function of x and the resulting optimal 7 
is in fact induced from a transport map. This idea provides a criterion for solvability 
of the Monge problem (Villani [125, Theorem 5.30]). 
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1.8.1 Unconstrained Dual Kantorovich Problem 


It turns out that the dual Kantorovich problem can be recast as an unconstrained 
optimisation problem of only one function @. The new formulation is not only con- 
ceptually simpler than the original one, but also sheds light on the properties of the 
optimal dual variables. Since the dual objective function to be maximised, 


p odu + | ydv, 


is increasing in @ and y, one should seek functions that take values as large as 
possible subject to the constraint (x) + w(y) < c(x,y). Suppose that an oracle tells 
us that some @ € Lı (u) is a good candidate. Then the largest possible y satisfying 
(Q, y) € ®, is defined as 


vO) = inf [e(x,y) — o (x)] = o0). 


A function taking this form is called c-concave [124, Chapter 2]; we say that y is the 
c-transform of @. It is not necessarily true that ° is integrable or even measurable, 
but if we neglect this difficulty, then it is obvious that 


sup if, oa f, vav| = f oiu fo dv. 
wel (v y)E®: 


The dual problem can thus be formulated as the unconstrained problem 


sup Le pau | o av]. 
peli (u 


One can apply this c-transform again and replace @ by 


Ore Py a ey) PO) = Oe), 
so that p has a better objective value yet still (°°, 0°) € $, (modulo measurability 
issues). An elementary calculation shows that in general 0° = ¢—°. Thus, for any 
function @y, the pair of functions (@, y) = (f°, pf) has a better objective value than 
(1, y1), and satisfies (p, y) E€ Ð.. Moreover, p° = y and y~ = ọ; in words, p and 
y are c-conjugate. An optimal dual pair (@, y) can be expected to be c-conjugate; 
this is indeed true almost surely: 


Proposition 1.8.1 (Existence of an Optimal Pair) Let u and v be probability mea- 
sures on & and Y such that the independent coupling with respect to the nonnega- 
tive and lower semicontinuous cost function is finite: J 9-,.4/¢(x,y) du(x)dv(y) < 
Then there exists an optimal pair (Q, y) for the dual Kantorovich problem. Further- 
more, the pair can be chosen in a way that U-almost surely, @ = wW° and v-almost 
surely, y = Q°. 


Proposition 1.8.1 is established (under weaker conditions) by Ambrosio and Pratelli 
[11, Theorem 3.2]. It is clear from the discussion above that once existence of an 
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optimal pair (@, y1) is established, the pair (p, y) = (f°, pf) should be optimal. 
Combining Proposition 1.8.1 with the preceding subsection, we see that if @ is op- 
timal (for the unconstrained dual problem), then any optimal transference plan 7* 
must be concentrated on the set 


{(x,y) : p) + o0) =e y)}- 


If for u-almost every x this equation defines y uniquely as a (measurable) function 
of x, then 2* is induced by a transport map. Indeed, we have seen how this is the 
case, in the quadratic case c(x,y) = ||x — y||?/2, when u is absolutely continuous. 
An extension to p > 1 (instead of p = 2) is sketched in Sect. 1.8.3. 

We remark that at the level of generality of Proposition 1.8.1, the function @* 
may fail to be Borel measurable; Ambrosio and Pratelli show that this pair can be 
modified up to null sets in order to be Borel measurable. If c is continuous, however, 
then ° is an infimum of a collection of continuous functions (in y). Hence —@*° is 
lower semicontinuous, which yields that @° is measurable. When c is uniformly 
continuous, measurability of @° is established in a more lucid way, as exemplified 
in the next subsection. 


1.8.2 The Kantorovich—Rubinstein Theorem 


Whether @°(y) is tractable to evaluate depends on the structure of c. We have seen 
an example where c was the quadratic Euclidean distance. Here, we shall consider 
another useful case, where c is a metric. Assume that X = Y, denote their metric 
by d, and let c(x,y) = d(x,y). If @ = y% is c-concave, then it is 1-Lipschitz. Indeed, 
by definition and the triangle inequality 


Pl) = ink d(c.») — w0)] < inf d(x») +4,2) — wO) = Pla) +4(x,2). 
Interchanging x and z yields |@(x) — @(z)| < d(x,z)? 

Next, we claim that if @ is Lipschitz, then @°(y) = —@(y). Indeed, choosing 
x = y in the infimum shows that 0°(y) < d(y,y) — @(y) = —o(y). But the Lipschitz 
condition on @ implies that for all x, d(x,y) — g(x) > —@(y). In view of that, we 
can take in the dual problem ọ to be Lipschitz and y = —@, and the duality formula 
(Theorem 1.4.2) takes the form 


inf d(x,y)da(x,y) = su | d -f dv], 
ae (x,y) da(x,y) is aan 
le(x) -— e(y) | 
|@||Lip = sup —————. (1.11) 
=e x#y d(x,y) 


3 In general, y° inherits the modulus of continuity of c, see Santambrogio [119, page 11]. 
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This is known as the Kantorovich—Rubinstein theorem [124, Theorem 1.14]. (We 
have been a bit sloppy because @ may not be integrable. But if for some xo € Z, 
x> d(x,x9) is in Lı (u), then any Lipschitz function is U-integrable. Otherwise one 
needs to restrict the supremum to, e.g., bounded Lipschitz @.) 


1.8.3 Strictly Convex Cost Functions on Euclidean Spaces 


We now return to the Euclidean case .2 = % = R? and explore the structure of 
c-transforms. When c is different than ||x— y||?/2, we can no longer “open up the 
square” and relate the Monge—Kantorovich problem to convexity. However, we can 
still apply the idea that p(x) + @°(y) = c(x,y) if and only if the infimum is attained 
at x. Indeed, recall that 


p(y) = int [e(x,y) — ọ(x)], 


so that p(x) + ọ°(y) = c(x,y) if and only if 


o(z)- p(x) <c(z,y)-—c(x,y), ZEX. 


Notice the similarity to the subgradient inequality for convex functions, with the 
sign being reversed. In analogy, we call the collection of y’s satisfying the above in 
equality the c-superdifferential of @ at x, and we denote it by 0°@(x). Of course, if 
c(x,y) = ||x— yl|?/2, then y € 3° (x) if and only if y is a subgradient of (|| - ||7/2—@) 
at x. 

The following result generalises Theorem 1.6.2 to other powers p > 1 of the 
Euclidean norm. These cost functions define the Wasserstein distances of the next 
chapter. 


Theorem 1.8.2 (Strictly Convex Costs on R?) Let c(x,y) = h(x— y) with h(v) = 
|| ||? /p for some p > 1 and let u and v be probability measures on R¢ with finite p- 
th moments such that u is absolutely continuous with respect to Lebesgue measure. 
Then the solution to the Kantorovich problem with cost function c is unique and 
induced from a transport map T. Furthermore, there exists an optimal pair (Q, 0°) 
of the dual problem, with @ c-concave. The solutions are related by 


T(x) =x—V(x)||Ve(x)|| vee (u-almost surely). 


Proof (Assuming v has Compact Support). The existence of the optimal pair (@, °) 
with the desired properties follows from Proposition 1.8.1 (they are Borel mea- 
surable because c is continuous). We shall now show that @ has a unique c- 
supergradient u-almost surely. 

Step 1: ọ is c-superdifferentiable. Let z* be an optimal coupling. By duality 
arguments, 7 is concentrated on the set of (x,y) such that y € 0°@(x). Consequently, 
for u-almost any x, the c-superdifferential of @ at x is nonempty. 
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Step 2: ¢ is differentiable. Here, we impose the additional condition that v is 
compactly supported. Then @ can be taken as a c-transform on the compact support 
of v. Since h is locally Lipschitz (it is C! because p > 1) this implies that @ is locally 
Lipschitz. Hence, it is differentiable Lebesgue almost surely, and consequently u- 
almost surely. 

Step 3: Conclusion. For u-almost every x there exists y € 0°@(x) and a gradient 
u = V(x). In particular, u is a subgradient of @: 


(z) — P(x) > (u,z—x) + 0(||z—>l]). 


Here and more generally, o(||z— x||) denotes a function r(z) (defined in a neigh- 
bourhood of x) such that r(z)/||z—x|| — 0 as z — x. (If @ were convex, then we 
could take r = 0, so the definition for convex functions is equivalent, and then the 
inequality holds globally and not only locally.) But y € 0°@(x) means that as z — x, 


h(z—y) —h(x-y) =c(z,y) — (x,y) = (z) — p(x) > (u,z— x) +0(l|z — xl|). 


In other words, u is a subgradient of h at x— y. But h is differentiable with gradient 
Vh(v) = v||v||?-? (zero if v = 0). We obtain V(x) = u = Vh(x — y) and since the 
gradient of h is invertible, we conclude 


y=T(x):=x—-(Vh)'[Voe(a)], 


which defines y as a (measurable) function of x.* Hence, the optimal transference 
plan z is unique and induced from the transport map T. 


The general result, without assuming compact support for v, can be found in 
Gangbo and McCann [59]. It holds for a larger class of functions h, those that are 
strictly convex on R¢ (this yields that Vh is invertible), have superlinear growth 
(h(v)/||v|| — ce as v > œ) and satisfying a technical geometric condition (which 
||v||?/p does when p > 1). Furthermore, if A is sufficiently smooth, namely h € Cl! 
locally (it is if p > 2, but not if p € (1,2)), then u does not need to be absolutely 
continuous; it suffices that it not give positive measure to any set of Hausdorff di- 
mension smaller or equal than d — 1. When d = 1 this means that Theorem 1.8.2 is 
still valid as long as u has no atoms (u({x}) = 0 for all x € R), which is a weaker 
condition than u being absolutely continuous. 

It is also noteworthy that for strictly concave cost functions (e.g., p € (0, 1)), the 
situation is similar when the supports of and v are disjoint. The reason is that h 
may fail to be differentiable at 0, but it only needs to be differentiated at x — y with 
x € suppu and y € suppv. If the supports are not disjoint, then one needs to leave all 
common mass in place until the supports become disjoint (Villani [124, Chapter 2]) 
and then the result of [59] applies. As a simple example, let u be uniform on [0, 1] 
and v be uniform on [0,2]. After leaving common mass in place, we are left with 
uniforms on (0, 1] and [1,2] (with total mass 1/2) with essentially disjoint supports, 


4 Gradients of Borel functions are measurable, as the limit can be taken on a countable set. The 
inverse (Vh)—! equals the gradient of the Legendre transform h* and is therefore Borel measurable. 
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for which the optimal transport map is the decreasing map T (x) = 2 — x. Thus, the 
unique optimal 7 is not induced from a map, but rather from an equal weight mixture 
of T and the identity. Informally, each point x in the support of u needs to be split; 
half stays and x and the other half transported to 2 — x. The optimal coupling from v 
to u is unique and induced from the map S(x) = x if x < 1 and 2— x if x > 1, which 
is neither increasing nor decreasing. 
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Chapter 2 D 
The Wasserstein Space Check for 


updates 


The Kantorovich problem described in the previous chapter gives rise to a metric 
structure, the Wasserstein distance, in the space of probability measures P( 2’) on 
a space X. The resulting metric space, a subspace of P(X), is commonly known 
as the Wasserstein space W (although, as Villani [125, pages 118-119] puts it, this 
terminology is “very questionable”; see also Bobkov and Ledoux [25, page 4]). In 
Chap. 4, we shall see that this metric is in a sense canonical when dealing with 
warpings, that is, deformations of the space 2 (for example, in Theorem 4.2.4). 
In this chapter, we give the fundamental properties of the Wasserstein space. Af- 
ter some basic definitions, we describe the topological properties of that space in 
Sect. 2.2. It is then explained in Sect.2.3 how W can be endowed with a sort of 
infinite-dimensional Riemannian structure. Measurability issues are dealt with in 
the somewhat technical Sect. 2.4. 


2.1 Definition, Notation, and Basic Properties 


Let X be a separable Banach space. The p-Wasserstein space on X is defined as 


Yl) ={ wer): | waoe}, p>. 


We will sometimes abbreviate and write simply Y, instead of 9 (X). 

Recall that if u,v € P(X), then TI (u,v) is defined to be the set of measures 
r € P(X?) having u and v as marginals in the sense of (1.2). The p-Wasserstein 
distance between u and v is defined as the minimal total transportation cost between 
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and v in the Kantorovich problem with respect to the cost function c,(x,y) = 
llx—yll?: 


1/p 1/p 
W,(u,v)=( inf C —( inf — xp||P ar, 
v= (iE o) = (int. fy paal ae aa)]) 


The Wasserstein distance between u and v is finite when both measures are in 
P(X ), because 
[xa = x2]? < 2? la |? +2? x21. 


Thus, W, is finite on [7 (X)? = % (2) x W(X); it is nonnegative and sym- 
metric and it is easy to see that W,(u,v) = 0 if and only if u = v. A proof that 
W, is a metric (satisfies the triangle inequality) on Y, can be found in Villani [124, 
Chapter 7]. 

The aforementioned setting is by no means the most general one can consider. 
Firstly, one can define W, and Y, for 0 < p < 1 by removing the power 1/p from 
the infimum and the limit case p = 0 yields the total variation distance. Another 
limit case can be defined as W.,(u,Vv) = limp... Wp(u, v). Moreover, W, and Y, 
can be defined whenever 2 is a complete and separable metric space (or even 
only separable; see Clément and Desch [36]): one fixes some xo in 2 and replaces 
||x|| by d(x,xo). Although the topological properties below still hold at that level of 
generality (except when p = 0 or p =o»), for the sake of simplifying the notation we 
restrict the discussion to Banach spaces. It will always be assumed without explicit 
mention that 1 < p < œ. 

The space Y,,(.2°) is defined as the collection of measures u such that W, (u, ĉo) < 
œ with 6, being a Dirac measure at x. Of course, Wp(u, v) can be finite even if 
U,V É DPX). But if u E Y,(2) and v ¢ Y,(#), then W,(u, v) is always infi- 


nite. This can be seen from the triangle inequality 
œ = Wp(v, ôo) < Wp (u, 60) +Wp (u, v). 


In the sequel, we shall almost exclusively deal with measures in ⁄ (X). 
The Wasserstein spaces are ordered in the sense that if q > p, then (Z) C 
W,( X). This property extends to the distances in the form: 


q>p>1 => W,(u,v) > W,(u,Vv). (2.1) 


To see this, let m € H (u,v) be optimal with respect to q. Jensen’s inequality for the 
convex function z++ z4/? gives 


q/P 
wav) = f b-silaxtns)> ( f slants) =w) 
ae K? 


The converse of (2.1) fails to hold in general, since it is possible that W, be finite 
while W; is infinite. A converse can be established, however, if u and v are bounded: 
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1—p/q 
q2p21, u(K)=v(K)=1 = wu) Wp (av) (sap sv 
x,y! 


(2.2) 
Indeed, if we denote the supremum by dx and let m be now optimal with respect to 
p, then m(K x K) = 1 and 


wilu) < f le—sllan(ey) < de? f x-y? axl») = dW? Cv). 


Another useful property of the Wasserstein distance is the upper bound 


1/p 

Mestu < (J) -sola = -sle l CD 
for any pair of measurable functions t,s : 2° — X. Situations where this inequal- 
ity holds as equality and t and s are optimal maps are related to compatibility of 
the measures u, v = t#u and p = s#u (see Sect. 2.3.2) and will be of conceptual 
importance in the context of Fréchet means (see Sect. 3.1). 

We also recall the notation Br(xo) = {x : ||x —xo|| < R} and Br(xo) = {x : ||x— 
xol|| < R} for open and closed balls in 2. 


2.2 Topological Properties 


2.2.1 Convergence, Compact Subsets 


The topology of a space is determined by the collection of its closed sets. Since 
W(X) is a metric space, whether a set is closed or not depends on which se- 
quences in Y%,(.2°) converge. The following characterisation from Villani [124, 
Theorem 7.12] will be very useful. 


Theorem 2.2.1 (Convergence in Wasserstein Space) Let U, Un E€ Y%)(2). Then 
the following are equivalent: 


1. Wp(Un, U) > 0 as n > %; 
2. Ln => p weakly and fa |x]|? dpin(x) > flix? du (a); 
3. Un — U weakly and 


sup \|x||? dun (x) > 0, R > 0; (2.4) 
n J{x:||x||>R} 


4. for any C > 0 and any continuous f : X — R such that | f (x)| < C(1 +|lx||?) for 
all x, 


[f@) dun(x) > J1 du (x). 


40 2 The Wasserstein Space 


5. (Le Gouic and Loubes [87, Lemma 14]) U, — u weakly and there exists v € 
W(X) such that Wp (Un, v) > Wp( u,v). 


Consequently, the Wasserstein topology is finer than the weak topology induced 
on W(X) from P(X). Indeed, let 4 C H,(2) be weakly closed. If un € & 
converge to u in Ý (X), then Un + u weakly, so u € 2%. In other words, the 
Wasserstein topology has more closed sets than the induced weak topology. More- 
over, each Y,(.2°) is a weakly closed subset of P(X ) by the same arguments that 
lead to (1.3). In view of Theorem 2.2.1, a common strategy to establish Wasserstein 
convergence is to first show tightness and obtain weak convergence, hence a candi- 
date limit, and then show that the stronger Wasserstein convergence actually holds. 
In some situations, the last part is automatic: 


Corollary 2.2.2 Let KC X be a bounded set and suppose that U,(K) = 1 for all 
n> 1. Then Wp(Un, U) — 0 if and only if Un > u weakly. 


Proof. This is immediate from (2.4). 


The fact that convergence in Y, is stronger than weak convergence is exemplified 
in the following result. If 4n —> u and v, > v in Y,(.2°), then it is obvious that 
W,(Un, Vn) + Wp( u, Vv). But if the convergence is only weak, then the Wasserstein 
distance is still lower semicontinuous: 


lim inf Wp (Ln, Vn) > W,(u,v). (2.5) 


This follows from Theorem 1.7.2 and (1.3). 

Before giving some examples, it will be convenient to formulate Theorem 2.2.1 
in probabilistic terms. Let X,X, be random elements on 2 with laws U, Ln € 
W,(& ). Assume without loss of generality that X , X, are defined on the same prob- 
ability space (Q, F P) and write W,(Xn,X) to denote W, (Un, u). Then W,(Xn,X) > 
0 if and only if X, —> X weakly and E]|X,||? > E||X||?. 

An early example of the use of Wasserstein metric in statistics is due to Bickel 
and Freedman [21]. Let X,, be independent and identically distributed random vari- 
ables with mean zero and variance 1 and let Z be a standard normal random vari- 
able. Then Z, = X}; X;/,/n converge weakly to Z by the central limit theorem. 
But EZ? = 1 = EZ”, so W2(Z,,Z) — 0. Let Z* be a bootstrapped version of Z, 
constructed by resampling the X,,’s. If W2(Z;,Zn) — 0, then W2 (Z%,Z) — 0 and in 
particular Z% has the same asymptotic distribution as Zņ. 

Another consequence of Theorem 2.2.1 is that (in the presence of weak conver- 
gence) convergence of moments automatically yields convergence of smaller mo- 
ments (there are, however, more elementary ways to see this). In the previous exam- 
ple, for instance, one can also conclude that E|Z,,|? —> E|Z|? for any p < 2 by the 
last condition of the theorem. If in addition OX? < co, then 


3 
iZi =3-—+4+—1343=EzZ' 
n n 
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(see Durrett [49, Theorem 2.3.5]) so W4(Z,,Z) — 0 and all moments up to order 4 
converge. 

Condition (2.4) is called uniform integrability of the function x > ||x||? with re- 
spect to the collection (Un). Of course, it holds for a single measure u E (X) 
by the dominated convergence theorem. This condition allows us to characterise 
compact sets in the Wasserstein space. One should beware that when Z is infinite- 
dimensional, (2.4) alone is not sufficient in order to conclude that U, has a con- 
vergent subsequence: take Un to be Dirac measures at e, with (e,) an orthonormal 
basis of a Hilbert space 2 (or any sequence with ||e„|| = 1 that has no convergent 
subsequence, if 2 is a Banach space). The uniform integrability (2.4) must be ac- 
companied with tightness, which is a consequence of (2.4) only when 2 = Rê. 


Proposition 2.2.3 (Compact Sets in Y,) A weakly tight set X CW, is Wasserstein- 
tight (has a compact closure in W,) if and only if 


sup f Ixl du(x) 30, R>. (2.6) 
pert Halsl|>R} 


Moreover, (2.6) is equivalent to the existence of a monotonically divergent function 
g : R4 > R+ such that 


sup f xl!a( lll) du(x) <=. 
EXI K 


The proof is on page 41 of the supplement. 


Remark 2.2.4 For any sequence (Un) in Ý (tight or not) there exists a monotoni- 
cally divergent g with J 9-||x||? g(||x||) dun (x) < œ% for all n. 


Corollary 2.2.5 (Measures with Common Support) Let KC X be a compact 
set. Then 
KH =W (K) = {He P(2)-U(K) = 1} CW, (2%) 


is compact. 


Proof. This is immediate, since .% is weakly tight and the supremum in (2.6) van- 
ishes when R is larger than the finite quantity sup,<x ||x||. Finally, K is closed, so # 
is weakly closed, hence Wasserstein closed, by the portmanteau Lemma 1.7.1. 


For future reference, we give another consequence of uniform integrability, called 
uniform absolute continuity 


Ve 46 Vn VA C X Borel: Mn(A) <6 ==> [IKP dm) <E. (2.7) 
A 


To show that (2.4) implies (2.7), let € > 0, choose R = Rẹ > 0 such that the supre- 
mum in (2.4) is smaller than €/2, and set 6 = €/(2R?). If u„,(A) < 6, then 
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[amS fig lll? une) f y ll? dl) < BR? +e/2< e. 


2.2.2 Dense Subsets and Completeness 


If we identify a measure u € %⁄( X ) with a random variable X (having distribu- 
tion u), then X has a finite p-th moment in the sense that the real-valued random 
variable ||X|| is in Lp. In view of that, it should not come as a surprise that Y,(.2°) 
enjoys topological properties similar to Lp spaces. In this subsection, we give some 
examples of useful dense subsets of Y,(.2°) and then “show” that like 2 itself, it 
is a complete separable metric space. In the next subsection, we describe some of 
the negative properties that YW, (2°) has, again in similarity with Lp spaces. 

We first show that Y,(.2°) is separable. The core idea of the proof is the feasi- 
bility of approximating any measure with discrete measures as follows. 

Let u be a probability measure on 2’, and let X1,X2,... be a sequence of inde- 
pendent random elements in 2 with probability distribution u. Then the empirical 
measure Un is defined as the random measure (1/n)>"_, 6{X;}. The law of large 
numbers shows that for any (measurable) bounded or nonnegative f : 2 — R, al- 
most surely 


[ADO = 5% £00) + BVH) = [Peau 


In particular when f(x) = ||x||?, we obtain convergence of moments of order p. 
Hence by Theorem 2.2.1, if u E W(X), then Uun > u in Y,(2’) if and only if 
Un — U weakly. We know that integrals of bounded functions converge with prob- 
ability one, but the null set where convergence fails may depend on the chosen 
function and there are uncountably many such functions. When Z = Rf, by the 
portmanteau Lemma 1.7.1 we can replace the collection C,(.2°) by indicator func- 
tions of rectangles of the form (—%,a1] x -++ x (—e, ay] for a = (a1,...,ag) E RË. 
It turns out that the countable collection provided by rational vectors a suffices (see 
the proof of Theorem 4.4.1 where this is done in a more complicated setting). For 
more general spaces Z, we need to find another countable collection { f;} such that 
convergence of the integrals of f; for all j suffices for weak convergence. Such a col- 
lection exists, by using bounded Lipschitz functions (Dudley [47, Theorem 11.4.1]); 
an alternative construction can be found in Ambrosio et al. [12, Section 5.1]. Thus: 


Proposition 2.2.6 (Empirical Measures in %,) For any u € P(X ) and the corre- 
sponding sequence of empirical measures Un, Wp(Un, U) — 0 almost surely if and 


only ifu E Wy 2). 


Indeed, if u ¢ H, (2), then Wp(HUn, L) is infinite for all n, since U, is compactly 
supported, hence in Y%, (2°). 

Proposition 2.2.6 is the basis for constructing dense subsets of the Wasserstein 
space. 
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Theorem 2.2.7 (Dense Subsets of Y,) The following collections of measures are 
dense in W,( 2): 


1. finitely supported measures with rational weights; 

2. compactly supported measures; 

3. finitely supported measures with rational weights on a dense subset A C X; 

4. if X =R", the collection of absolutely continuous and compactly supported 
measures; 

5. if X =R¥%, the collection of absolutely continuous measures with strictly posi- 
tive and bounded analytic densities. 


In particular, W, is separable (the third set is countable as 2 is separable). 


This is a simple consequence of Proposition 2.2.6 and approximations, and the de- 
tails are given on page 43 in the supplement. 


Proposition 2.2.8 (Completeness) The Wasserstein space Wp( X ) is complete. 


One may find two different proofs in Villani [125, Theorem 6.18] and Ambrosio et 
al. [12, Proposition 7.1.5]. On page 43 of the supplement, we sketch an alternative 
argument based on completeness of the weak topology. 


2.2.3 Negative Topological Properties 


In the previous subsection, we have shown that %,(.2°) is separable and complete 
like Lp spaces. Just like them, however, the Wasserstein space is neither locally com- 
pact nor o-compact. For this reason, existence proofs of Fréchet means in %, (2°) 
require tools that are more specific to this space, and do not rely upon local com- 
pactness (see Sect. 3.1). 


Proposition 2.2.9 (Y, is Not Locally Compact) Let u E (X) and let € > 0. 
Then the Wasserstein ball 


Be(u) = {V E (Z) : Wp(u,v) < €} 


is not compact. 


Ambrosio et al. [12, Remark 7.1.9] show this when u is a Dirac measure, and we 
extend their argument on page 43 of the supplement. 
From this, we deduce: 


Corollary 2.2.10 The Wasserstein space W(X ) is not O-compact. 


Proof. If X is a compact set in Y,(.2°), then its interior is empty by Proposi- 
tion 2.2.9. A countable union of compact sets has an empty interior (hence cannot 
equal the entire space YW, (.2°)) by the Baire property, which holds on the complete 
metric space Y, (2°) by the Baire category theorem (Dudley [47, Theorem 2.5.2]). 
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2.2.4 Covering Numbers 


Let X C W,(2) be compact and assume that 2° = R°. Then for any € > 0 the 
covering number 


Nei) min | E W,(2) such that # C {u mise 


i=1 


is finite. These numbers appear in statistics in various ways, particularly in empir- 
ical processes (see, for instance, Wainwright [126, Chapter 5]) and the goal of this 
subsection is to give an upper bound for N(€;.%). Invoking Proposition 2.2.3, in- 
troduce a continuous monotone divergent f : [0,c°) — [0, c9] such that 


sup J IKIPE a(x) <1. 
EX JR 


The function f provides a certain measure of how compact .% is. If # = W,(K) is 
the set of measures supported on a compact K C Rf, then f(L) can be taken infinite 
for L large, and L can be treated as a constant in the theorem. Otherwise L increases 
as € N 0, at a speed that depends on f: the faster f diverges, the slower L grows 
with decreasing € and the better the bound becomes. 


Theorem 2.2.11 Let € > 0 and L = f~'(1/e?). Ifde < L, then 


d 
gne) <c) (Z) [p+ +a), 


with Cı (d) = 34e0y, C2(d, p) = (p +d)log3 + (p +2) log2 +log 04 and 04 = d[5 + 
logd + loglogd]. 


Since € > 0 is small and L is increasing in g, the restriction that de < L is typically 
not binding. We provide some examples before giving the proof. 

Example 1: if all the measures are supported on the d-dimensional unit ball, then 
L can be taken equal to one, independently of €. We obtain 


~ .  logN(€) 


N(e) := wer < (d+ p)C\(d)e~4 + smaller order terms. 


Example 2: if all the measures in .% have uniform exponential moments, then 
f(L) =e and N(e) is a constant times £~" log 1/e]“. The exponent p appears only 
in the constant. 

Example 3: suppose that .% is a Wasserstein ball of order p+ ô, that is, f(L) = 
LË. Then L ~ e~?/9 and 


N(e) < C\(d)(pt.a)(1+ p/d)e~4l+/81 
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up to smaller order terms. Here (when 0 < 6 < ce) the behaviour of Ñ (€) depends 
more strongly upon p: if p' < p, then we can replace 6 by 6’ = ô + p — p' > 6, 


leading to a smaller magnitude of N(£). 
Example 4: if f(L) is only logL, then N behaves like e~ 4+?) exp(e~?4), so p 
has a very dominant effect. 


Proof. The proof is divided into four steps. 
Step 1: Compact support. Let P; : R? — R? be the projection onto B,(0) = 
{x ER? : ||x|| < L} and let u € .%. Then 


WPu Peta) < [lv Pela) Padus) = [lle PeLa) Pauls) 
1 1 
<f o Paus PODA < a, 
lxl>Z FÈL) Jixl>L (L) 

and this vanishes as L — œ. 

Step 2: n-Point measures. Let n = N (£; BL (0)) be the covering number of the 
Euclidean ball in R¢. There exists a set x1,...,Xn € R? such that Bz (0) C UBe(x;). 
If u € Y,(Bz(0)), there exists a measure U, supported on the x;’s and such that 


Wp(U, Hn) < E. 
Indeed let C1 = Be (x1), Ci = Be (xi) \Uj<iBe(x;) and define u,({x;}) = u (Ci). The 
transport map defined by t(x) = x; for x € C; pushes u forward to Un and 


WB nd) <È | e-un < $ern) =e" 


According to Rogers [114], we have the bound 
n< eðalL/e]f, 04 =d[5+logd + loglogd], 


whenever € < L/d. 
Step 3: Common weights. If u = Yai6,, and v = Yd; 6,,, then WP (u,v) < 
Le lak — bg| sup; ; ||xi —x;||?- Let 


Hne 5 = È ax 5x, : ax € {0,8,28,...,[1/818}; F ak = 7 l 
k=l 


This set contains fewer than (2+ 1/65)"~! elements, and any measure supported on 
{x1,...,Xn} can be approximated by a measure in Up ¢ 5 with error 2L(nd)!/?. 

Step 4: Conclusion. Let L = f—!(1/e?), n = N(€;Bz(0)) and ô = [e/(2L)]?/n. 
Combining the previous three steps, we obtain in the case L > ed that 


eðalL/e]? 


p+d eðalL/e]? p+d 
2+ (=) re <) 204 , 


N(3e€) < (2+1/6)""' < 
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because L/e > 1 and 6, > 5. Conclude that 


NPH 34e04[L/e]4 
N(e) < [pr (=) 204 


2.3 The Tangent Bundle 


Although the Wasserstein space %,(.2°) is non-linear in terms of measures, it is 
linear in terms of maps. Indeed, if u E H,(2) and T;: 2 — X are such that 
\|Ti|| € Lp(u), then (aT; + BIh)#u E W(X) for all a, B € R. Later, in Sect. 2.4, 
we shall see that YW, (2°) is in fact homeomorphic to a subset of the space of such 
functions. The goal of this section is to exploit the linearity of the latter in order to 
define the tangent bundle of Y,. This in particular will be used for deriving differ- 
entiability properties of the Wasserstein distance in Sect. 3.1.6. However, the latter 
can be understood at a purely analytic level, and readers uncomfortable with differ- 
ential geometry can access most of the rest of the monograph without reference to 
this section. 

We assume here that 2° is a Hilbert space and that p = 2; the results extend to 
any p > 1. Absolutely continuous measures are assumed to be so with respect to 
Lebesgue measure if .2° = R@ and otherwise refer to Definition 1.6.4. 


2.3.1 Geodesics, the Log Map and the Exponential Map in 7y( X) 


Let y E€ W(X ) be absolutely continuous and u € W(.2’) arbitrary. From Sect. 1.6.1, 
we know that there exists a unique solution to the Monge—Kantorovich problem, 
and that solution is given by a transport map that we denote by ty Recalling that 
i: X — X is the identity map, we can define a curve 


% = [i+ e(t% —i)] #7, t € [0,1]. 


This curve is known as McCann’s [93] interpolant. As hinted in the introduction to 
this section, it is constructed via classical linear interpolation of the transport maps 
t and the identity. Clearly % = y, yı = u and from (2.3), 


W2(%,7) < yh, [v(t —)]? dy = tWo(7, 4); 


mam) sf f, [0-90 a) av= -watu 


It follows from the triangle inequality in %⁄ that these inequalities must hold as 
equalities. Taking this one step further, we see that 
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Wo(%sYs) =(t—s)WalyH),  OSs<t<l. 


In other words, McCann’s interpolant is a constant-speed geodesic in P(X). 
In view of this, it seems reasonable to define the tangent space of H2( 2%) at u 
as (Ambrosio et al. [12, Definition 8.5.1]) 


Tan, = {t(t— i) : t = ty for some v E€ H(2);t > oy 2, 
It follows from the definition that Tan, C Lz (1). (Strictly speaking, Tan, is a subset 
of the space of functions f : 2° > Z such that || f|| € L2() rather than L2 (pL) itself, 
as in Definition 2.4.3, but we will write Ly for simplicity.) 
Although not obvious from the definition, this is a linear space. The reason is 
that, in Rf, Lipschitz functions are dense in L2(4), and for t Lipschitz the negative 
of a tangent element 


-1(t-i)=s(s—i), s > tlitlltip, s=i+“(i—t) 


lies in the tangent space, since s can be seen to belong to the subgradient of a con- 
vex function by definition of s. This also shows that Tan, can be seen to be the 
Lz(u)-closure of all gradients of C7 functions. We refer to [12, Definition 8.4.1 and 
Theorem 8.5.1] for the proof and extensions to other values of p > 1 and to infinite 
dimensions, using cylindrical functions that depend on finitely many coordinates 
[12, Definition 5.1.11]. The alternative definition highlights that it is essentially the 
inner product in Tan, but not the elements themselves, that depends on 4. 

The tangent space definition is valid for arbitrary measures in W(2’). The ex- 
ponential map at y E€ W(.2’) is the restriction to Tany of the mapping that sends 
r € L(y) to [r+ i}#y E€ A( 2). More explicitly, exp, : Tany —> % takes the form 


exp,(t(t—i)) = exp (t+ (1 —1)i] -i) = [t+ (1 —t)ij#y (eR). 


Thus, when y is absolutely continuous, exp, is surjective, as can be seen from its 
right inverse, the log map 


log, : #2 — Tany log,(u) = ty —i, 
defined throughout VW (by virtue of Theorem 1.6.2). In symbols, 
exp,(log,(u))=H, HEWs, and log,(exp,(t(t—i)))=r(t-i) (t€ [0,1]), 


because convex combinations of optimal maps are optimal maps as well. In particu- 
lar, McCann’s interpolant li + (ty — i)] #y is mapped bijectively to the line segment 
t(ty —i) € Tan, through the log map. 

It is also worth mentioning that McCann’s interpolant can also be defined as 


[tpo+(1—t)pij#a, = pi (x,y) =x, p2(x,y) =Y, 
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where p1, p2 : X? > X are projections and 7 is any optimal transport plan between 
y and u. This is defined for arbitrary measures y, u E€ W, and reduces to the previous 
definition if y is absolutely continuous. It is shown in Ambrosio et al. [12, Chapter 
7] or Santambrogio [119, Proposition 5.32] that these are the only constant-speed 
geodesics in ~. 


2.3.2 Curvature and Compatibility of Measures 


Let y, U,V E #2(2% ) be absolutely continuous measures. Then by (2.3) 
Wy (U,V) < [le (x) — ty (x)||? dy(x) = || log, (uw) —log,(v)||?. 


In other words, the distance between u and v is smaller in (2) than the distance 
between the corresponding vectors log,(u) and log,(v) in the tangent space Tany. 
In the terminology of differential geometry, this means that the Wasserstein space 
has nonnegative sectional curvature at any absolutely continuous y. 

It is instructive to see when equality holds. As t% = (ty) ' a change of variables 
gives 


wiv) < f IEG) —a1Pav(a), 


Since the map ty ot% pushes forward v to u, equality holds if and only if t oth =t}. 
This motivates the following definition. 


Definition 2.3.1 (Compatible Measures) A collection of absolutely continuous 
measures © C W( 2) is compatible if for all y,u,v € ©, we have ty ot} = th 
(in L2(v)). 


Remark 2.3.2 The absolute continuity is not necessary and was introduced for no- 
tational simplicity. A more general definition that applies to general measures is the 
following: every finite subcollection of € admits an optimal multicoupling whose 
relevant projections are simultaneously pairwise optimal; see the paragraph pre- 
ceding Theorem 3.1.9. 


A collection of two (absolutely continuous) measures is always compatible. More 
interestingly, if X = R, then the entire collection of absolutely continuous (or even 
just continuous) measures is compatible. This is because of the simple geometry of 
convex functions in R: gradients of convex functions are nondecreasing, and this 
property is stable under composition. In a more probabilistic way of thinking, one 
can always push-forward u to v via the uniform distribution Leb| [0,1] (see Sect. 1.5). 
Letting Fe and F; ! denote the quantile functions, we have seen that 


Wo(u,Vv) = Fu — F; '||z3(0,1): 
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(As a matter of fact, in this specific case, the equality holds for all p > 1 and not 
only for p = 2.) In other words, u > Hy? is an isometry from W%(R) to the subset of 
L(0, 1) formed by (equivalence classes of) left-continuous nondecreasing functions 
on (0,1). Since this is a convex subset of a Hilbert space, this property provides a 
very simple way to evaluate Fréchet means in #7(R) (see Sect. 3.1). If y= Leb|jo,1), 
then ie = i for all u, so we can write the above equality as 
W3 (u,v) = [Fu — Fy 'laa(o.1) = lllogy(u) —log,(v)||?, 

so that if 2 =R, the Wasserstein space is essentially flat (has zero sectional curva- 
ture). 

The importance of compatibility can be seen as mimicking the simple one- 
dimensional case in terms of a Hilbert space embedding. Let € C #(.2’) be com- 
patible and fix y € @. Then for all u,v € G 


w? (u,v) = [le (x) — ty (x)|[?dy(x) = |[]og,(u) — logy(v)Ilz,(y)- 


Consequently, once again, U +> t is an isometric embedding of @ into L(y). Gen- 
eralising the one-dimensional case, we shall see that this allows for easy calculations 
of Fréchet means by means of averaging transport maps (Theorem 3.1.9). 

Example: Gaussian compatible measures. The Gaussian case presented in 
Sect. 1.6.3 is helpful in shedding light on the structure imposed by the compati- 
bility condition. Let y € W(R“) be a standard Gaussian distribution with identity 
covariance matrix. Let X, denote the covariance matrix of a measure u € W#a (R). 
When u and v are centred nondegenerate Gaussian measures, 


vas e 


y = 3/2, tY _ 5 ts EE 


so that y, u, and v are compatible if and only if 
_ Y _ yl/2y-1/2 
tu =tyot, = Ey Ep . 


Since the matrix on the left-hand side must be symmetric, it must necessarily be 
that a) * and Eu 1/2 commute (if A and B are symmetric, then AB is symmetric if 
and only if AB = BA), or equivalently, if and only if Xy and 2,, commute. We see 
that a collection @ of Gaussian measures on R that includes the standard Gaussian 
distribution is compatible if and only if all the covariance matrices of the measures 
in @ are simultaneously diagonalisable. In other words, there exists an orthogonal 
matrix U such that D, = UXU" is diagonal for all u € @. In that case, formula 


(1.6) 


WRU, v) = tr[Eu + Ev — AEK EEA] = PFE ST rl ral 
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simplifies to 


d 
W2 (u,v) = tr[Dy +Dy — 2D, Dy] = ¥ (Voi VB), 04 = [Dalis Bi = [Dv]u, 
i=l 


and identifying the (nonnegative) number a € R with the map x +> ax on R, the 
optimal maps take the “orthogonal separable” form 


ty = 2y7 I UD Da U = Uo (VBi/ea,..-,/Balea) eU 


In other words, up to an orthogonal change of coordinates, the optimal maps take the 
form of d nondecreasing real-valued functions. This is yet another crystallisation of 
the one-dimensional-like structure of compatible measures. 

With the intuition of the Gaussian case at our disposal, we can discuss a more 
general case. Suppose that the optimal maps are continuously differentiable. Then 
differentiating the equation t}, = t} © A gives 


Vt} (x) = Vt} (th (x)) Vti (x). 


Since optimal maps are gradients of convex functions, their derivatives must be 
symmetric and positive semidefinite matrices. A product of such matrices stays sym- 
metric if and only if they commute, so in this differentiable setting, compatibility is 
equivalent to commutativity of the matrices Vt, (tl (x)) and Vt}, (x) for u-almost all 
x. In the Gaussian case, the optimal maps are linear functions, so x does not appear 
in the matrices. 

Here are some examples of compatible measures. It will be convenient to de- 
scribe them using the optimal maps from a reference measure y € W%(IR“). Define 
€ = t#y with t belonging to one of the following families. The first imposes the 
one-dimensional structure by varying only the behaviour of the norm of x, while the 
second allows for separation of variables that splits the d-dimensional problem into 
d one-dimensional ones. 

Radial transformations. Consider the collection of functions t : R — R? of the 
form t(x) = xG(||x||) with G : R+ — R differentiable. Then a straightforward calcu- 
lation shows that 


Vt(x) = G(x + [C (fall) /Tlal] 22°. 


Since both J and xx are positive semidefinite, the above matrix is so if both G 
and G’ are nonnegative. If s(x) = xA(||x||) is a function of the same form, then 
s(t(x)) = xG(||x||)A(||x||G(||x||)) which belongs to that family of functions (since 
G is nonnegative). Clearly 


Vs(t(x)) = H [Ilx G(x] + [GC Lxl])A” (xl G(Lxl1))/le xx 


commutes with Vt(x), since both matrices are of the form al + bxx with a,b scalars 
(that depend on x). In order to be able to change the base measure y, we need to 
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check that the inverses belong to the family. But if y = t(x), then x = ay for some 
scalar a that solves the equation 


aG(ally||) = 1. 


Such a is guaranteed to be unique if a > aG(a) is strictly increasing and it will exist 
(for y in the range of t) if it is continuous. As a matter of fact, since the eigenvalues 
of Vt(x) are G(a) and 


G(a)+G'(a)a=(aG(a))', a= lx), 


the condition that a+ aG(a) is strictly increasing is sufficient (this is weaker than 
G itself increasing). Finally, differentiability of G is not required, so it is enough if 
G is continuous and aG(a) is strictly increasing. 

Separable variables. Consider the collection of functions t : R — R¢ of the form 


t(x1,...,x¢) = (Ti(x1),---, Ta(xa)), T:R>R, (2.8) 


with T; continuous and strictly increasing. This is a generalisation of the compatible 
Gaussian case discussed above in which all the 7;’s were linear. Here, it is obvious 
that elements in this family are optimal maps and that the family is closed under 
inverses and composition, so that compatibility follows immediately. 

This family is characterised by measures having a common dependence structure. 
More precisely, we say that C : [0, 1]¢ — [0,1] is a copula if C is (the restriction of) 
a distribution function of a random vector having uniform margins. In other words, 
if there is a random vector V = (Vi,...,Va) with P(V; < a) =a for all a € [0, 1] and 
all 7 =1,...,d, and 


P(Vi <v,...,Va < va) =C(1,---, va), ui € (0, 1]. 


Nelsen [97] provides an overview on copulae. To any d-dimensional probability 
measure u, one can assign a copula C = C, in terms of the distribution function G 
of u and its marginals G; as 


G(ay,.--,4a) = U((—, a1] x +++ x (9, ag]) = C(Gi(a1),---;Ga(aa)). 


If each G; is surjective on (0, 1), which is equivalent to it being continuous, then this 
equation defines C uniquely on (0, 1)“, and consequently on (0, 1]¢. If some marginal 
Gj is not continuous, then uniqueness is lost, but C still exists [97, Chapter 2]. 
The connection of copulae to compatibility becomes clear in the following lemma, 
proven on page 51 in the supplement. 


Lemma 2.3.3 (Compatibility and Copulae) The copulae associated with abso- 
lutely continuous measures u,v € W(R¢) are equal if and only if tu takes the 
separable form (2.8). 


Composition with linear functions. If @ : R¢ — R is convex with gradient t and A is a 
d x d matrix, then the gradient of the convex function x> @ (Ax) atx is t4 = A't(Ax). 
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Suppose y is another convex function with gradient s and that compatibility holds, 
i.e., Vs(t(x)) commutes with Vt(x) for all x. Then in order for 


Vsa (ta(x)) =A'Vs(AA't(Ax))A and — Vta(x) = A'Vt(Ax)A 


to commute, it suffices that AA’ = J, i.e., that A be orthogonal. Consequently, if 
{t#u}ter are compatible, then so are {ty#uU}ter for any orthogonal matrix U. 


2.4 Random Measures in Wasserstein Space 


Let u be a fixed absolutely continuous probability measure in A(X). If A € 
W(X) is another probability measure, then the transport map ta and the convex 
potential are functions of A. If A is now random, then we would like to be able 
to make probability statements about them. To this end, it needs to be shown that 
ta and the convex potential are measurable functions of A. The goal of this sec- 
tion is to develop a rigorous mathematical framework that justifies such probability 
statements. We show that all the relevant quantities are indeed measurable, and in 
particular establish a Fubini-type result in Proposition 2.4.9. This technical section 
may be skipped at first reading. 

Here is an example of a measurability result (Villani [125, Corollary 5.22]). Re- 
call that P(X ) is the space of Borel probability measures on 2, endowed with the 
topology of weak convergence that makes it a metric space. Let 2% be a complete 
separable metric space and c : Z? — R4 a continuous cost function. Let (Q, F ,P) 
be a probability space and A, « : Q — P(X) be measurable maps. Then there ex- 
ists a measurable selection of optimal transference plans. That is, a measurable 
T: Q — P(X?) such that z(@) € TI(A (œ), K(@)) is optimal for all @ € Q. 

Although this result is very general, it only provides information about m. If 7 
is induced from a map T, it is not obvious how to construct T from m in a mea- 
surable way; we will therefore follow a different path. In order to (almost) have a 
self-contained exposition, we work in a somewhat simplified setting that neverthe- 
less suffices for the sequel. At least in the Euclidean case 2° = Rf, more general 
measurability results in the flavour of this section can be found in Fontbona et al. 
[53]. On the other hand, we will not need to appeal to abstract measurable selection 
theorems as in [53, 125]. 


2.4.1 Measurability of Measures and of Optimal Maps 


Let 2 be a separable Banach space. (Most of the results below hold for any com- 
plete separable metric space but we will avoid this generality for brevity and simpler 
notation). The Wasserstein space Y,(.2°) is a metric space for any p > 1. We can 
thus define: 
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Definition 2.4.1 (Random Measure) A random measure A is any measurable map 
from a probability space (Q, F ,P) to W,( X), endowed with its Borel o-algebra. 


In what follows, whenever we call something random, we mean that it is measurable 
as a map from some generic unspecified probability space. 


Lemma 2.4.2 A random measure A is measurable if and only if it is measurable 
with respect to the induced weak topology. 


Since both topologies are Polish, this follows from abstract measure-theoretic results 
(Fremlin [57, Paragraph 423F]). We give an elementary proof on page 53 of the 
supplement. 

Optimal maps are functions from Æ to itself. In order to define random optimal 
maps, we need to define a topology and a o-algebra on the space of such functions. 


Definition 2.4.3 (The Space -2,(u)) Let X be a Banach space and u a Borel 
measure on X. Then the space &,({) is the space of measurable functions 
f: X —> & such that 


Iflg = (fe 


When 2 is separable, %,(W) is an example of a Bochner space, though we will 
not use this terminology. 

It follows from the definition that ||f||_y,(.) is the Lp norm of the map x ++ 
\|f(x)||.a from X to R: 


1/p 
A aua) < o. 


llega = IF 


As usual we identify functions that coincide almost everywhere. Clearly, -2,(4) is a 
normed vector space. It enjoys another property shared by Lp spaces—completeness: 


VA DAME 


Theorem 2.4.4 (Riesz—Fischer) The space -2,(u) is a Banach space. 


The proof, a simple variant of the classical one, is given on page 53 of the supple- 
ment. 
Random maps lead naturally to random measures: 


Lemma 2.4.5 (Push-Forward with Random Maps) Let u € ⁄ (X ) and let t be 
a random map in (u). Then A = t#u is a continuous mapping from 2,(U) to 
W,( 2), hence a random measure. 


Proof. That A takes values in Y, follows from a change of variables 


f eaw f O = tlg <=. 


Since W, (t#u,s#u) < || |t- slæ lzy) = llt- sll. u) (see (2.3), A is a continu- 
ous (in fact, 1-Lipschitz) function of t. 


Conversely, t is a continuous function of A: 
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Lemma 2.4.6 (Measurability of Transport Maps) Let A be a random measure in 
D(X) and let u E W(X ) such that (i,t? )#u is the unique optimal coupling of 
u and A. Then A +> ti is a continuous mapping from W(X ) to H (u), so tà 
is a random element in Z%,(1). In particular, the result holds if X is a separable 
Hilbert space, p > 1, and u is absolutely continuous. 


Proof. This result is more subtle than Lemma 2.4.5, since A +> ti is not necessarily 
Lipschitz. We give here a self-contained proof for the Euclidean case with quadratic 
cost and u absolutely continuous. The general case builds on Villani [125, Corol- 
lary 5.23] and is given on page 54 of the supplement. 

Suppose that A, > A in W(IR%) and fix £ > 0. For any S C RI, 


An AJj2 pe n AJj2 An A \\2 
le th gay = fier -tilan + fie tilê du. 


Since ||a— b||? < 2?||a||? +2? ||b| 


P the last integral is no larger than 
af ta +f tila =4 | 2 dAn(x) +4 2dA(x). 
Lgl Pansa fea fang PMO aug PAA 


Since (A,,) and A are tight in the Wasserstein space, they must satisfy the absolute 
uniform continuity (2.7). Let 6 = ôs as in (2.7), and notice that by the measure 
preserving property of the optimal maps, the last two integrals are taken on sets of 
measures 1 — u (S). Since u is absolutely continuous, we can find a compact set S 
of u-measure at least 1 — 6 and on which Proposition 1.7.11 applies (see Corol- 
lary 1.7.12), yielding 


[ier — ef au < e-o n>, 


so that 
lim sup It" — th lac) < 8e, 


and this completes the proof upon letting € — 0. 


In Proposition 5.3.7, we show under some conditions that ta || Au) is a continuous 
function of u. 


2.4.2 Random Optimal Maps and Fubini’s Theorem 


From now on, we assume that 2 is a separable Hilbert space and that p = 2. The 
results can most likely be generalised to all p > 1 (see Ambrosio et al. [12, Sec- 
tion 10.2]), but we restrict to the quadratic case for simplicity. 
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Theorem 3.2.13 below requires the application of Fubini’s theorem in the form 


7 A e. 40 . = ny A . 40 . E ~À . 40 . 
f (t-i i) d= f a(t, ~i,t8, -i) ag = | ( stg, — 1,8, i) dp. 


In order for this to even make sense, we need to have a meaning for “expectation” in 
the spaces %(09) and L2(69), both of which are Banach spaces. There are several 
(nonequivalent) definitions for integrals in such spaces (Hildebrant [69]); the one 
which will be the most convenient for our needs is the Bochner integral. 


Definition 2.4.7 (Bochner Integral) Let B be a Banach space and let f : (Q, F, 
P) > B be a simple random element taking values in B: 


f(@) => fl{@e Qi}, QE F,  fjEB. 


Then the Bochner integral (or expectation) of f is defined by 


n 


of = > P(Q;)fj € B. 


j=l 


If f is measurable and there exists a sequence fa of simple random elements such 
that || fn — f|| —> 0 almost surely and E|| fa — f|| — 0, then the Bochner integral of f 
is defined as the limit 


Ef = limEfy. 


neo 


The space of functions for which the Bochner integral is defined is the Bochner 
space Lı (Q;B), but we will use neither this terminology nor the notation. It is not 
difficult to see that Bochner integrals are well-defined: the expectations do not de- 
pend on the representation of the simple functions nor on the approximating se- 
quence, and the limit exists in B (because it is complete). More on Bochner integrals 
can be found in Hsing and Eubank [71, Section 2.6] or Dunford et al. [48, Chap- 
ter II.6]. A major difference from the real case is that there is no clear notion of 
“infinity” here: the Bochner integral is always an element of B, whereas expecta- 
tions of real-valued random variables can be defined in R U {+}. It turns out that 
separability is quite important in this setting: 


Lemma 2.4.8 (Approximation of Separable Functions) Let f : Q — B be mea- 
surable. Then there exists a sequence of simple functions fa such that \\f,(@) — 
f(@)|| = 0 for almost all œ if and only if f(Q\ N) is separable for some N CQ 
of probability zero. In that case, fa can be chosen so that || fn(@)|| < 2|| f (@)|| for 
all @ € Q. 
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A proof can be found in [48, Lemma III.6.9], or on page 55 of the supplement. Func- 
tions satisfying this approximation condition are sometimes called strongly measur- 
able or Bochner measurable. In view of the lemma, we will call them separately 
valued, since this is the condition that will need to be checked in order to define 
their integrals. 

Two remarks are in order. Firstly, if B itself is separable, then f(Q) will obvi- 
ously be separable. Secondly, the set 4” C Q \ M on which (gn, ) does not converge 
to f may fail to be measurable, but must have outer probability zero (it is included 
in a measurable set of measure zero) [48, Lemma III.6.9]. This can be remedied by 
assuming that the probability space (Q, F ,P) is complete. It will not, however, be 
necessary to do so, since this measurability issue will not alter the Bochner expec- 
tation of f. 


Proposition 2.4.9 (Fubini for Optimal Maps) Let A be a random measure in 
W( 2) such that EW2(d9,A) < œ and let 09,0 E P(X) such that tp, and t 
exist (and are unique) with probability one. (For example, if 09 is absolutely contin- 
uous.) Then 


; A 40 : ./,A +40 ş 7 - 40 f 
aff (t —i t -i} dO =], a(t —i, tg, ~i) d6 sa EtG, —i to -i} dp. 


(2.9) 


This holds by linearity when A is a simple random measure. The general case fol- 
lows by approximation: the Wasserstein space is separable and so is the space of 
optimal maps, by Lemma 2.4.6, so we may apply Lemma 2.4.8 and approximate tg, 
by simple maps for which the equality holds by linearity. On page 56 of the sup- 
plement, we show that these simple maps can be assumed optimal, and give the full 
details. 


2.5 Bibliographical Notes 


Our proof of Theorem 2.2.11 borrows heavily from Bolley et al. [29]. A similar 
result was obtained by Kloeckner [81], who also provides a lower bound of a similar 
order. 

The origins of Sect. 2.3 can be traced back to the seminal work of Jordan et al. 
[74], who interpret the Fokker—Planck equation as a gradient flow (where function- 
als defined on W can be differentiated) with respect to the 2-Wasserstein metric. 
The Riemannian interpretation was (formally) introduced by Otto [99], and rigor- 
ously established by Ambrosio et al. [12] and others; see Villani [125, Chapter 15] 
for further bibliography and more details. 

Compatible measures (Definition 2.3.1) were implicitly introduced by Boissard 
et al. [28] in the context of admissible optimal maps where one defines families 
of gradients of convex functions (7;) such that i o T; is a gradient of a convex 
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function for any i and j. For (any) fixed measure y € @, compatibility of @ is then 
equivalent to admissibility of the collection of maps {ty ueg. The examples we 
gave are also taken from [28]. 

Lemma 2.3.3 is from Cuesta-Albertos et al. [38, Theorem 2.9] (see also Zemel 
and Panaretos [135]). 
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Chapter 3 A 
Fréchet Means in the Wasserstein Shep for 
Space %2 


If H is a Hilbert space (or a closed convex subspace thereof) and x1,...,xy € H, 
then the empirical mean xy = N7! $x; is the unique element of H that minimises 
the sum of squared distances from the x;’s.! That is, if we define 


N 
F(0)= J |0 -x| OH, 
i=l 


then 0 = Xy is the unique minimiser of F. This is easily seen by “opening the 
squares” and writing 
F(0) = F (žy) +N||@ —Xy||°. 


The concept of a Fréchet mean (Fréchet [55]) generalises the notion of mean to a 
more general metric space by replacing the usual “sum of squares” with a “sum 
of squared distances”, giving rise to the so-called Fréchet functional. A closely re- 
lated notion is that of a Karcher mean (Karcher [78]), a term that describes station- 
ary points of the sum of squares functional, when the latter is differentiable (see 
Sect. 3.1.6). Population versions of Fréchet means, assuming the space is endowed 
with a probability law, can also be defined, replacing summation by expectation with 
respect to that law. 


Electronic Supplementary Material The online version of this chapter (https://doi.org/10.1007/ 
978-3-030-38438-8_3) contains supplementary material. 


' Tt should be remarked that this is a Hilbertian property (or at least a property linked to an inner 
product), not merely a linear property. In other words, it does not extend to Banach spaces. As an 
example, let H = R? with the L; norm and consider the vertices (0,0), (0,1), and (1,0) of the unit 
simplex. The mean of these is (1/3, 1/3) but for (x,y) in the triangle, 


F(x,y) = (x+y)? + (x+1—y)?+(1—xty)? =24+27 +9? +(x-y)? 


is minimised at (0,0). 
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Fréchet means are perhaps the most basic object of statistical interest, and this 
chapter studies such means when the underlying space is the Wasserstein space W. 
In general, existence and uniqueness of a Fréchet mean can be subtle, but we will 
see that the nature of optimal transport allows for rather clean statements in the case 
of Wasserstein space. 


3.1 Empirical Fréchet Means in 7 


3.1.1 The Fréchet Functional 


As foretold in the preceding paragraph, the definition of a Fréchet mean requires the 
definition of an appropriate sum-of-squares functional, the Fréchet functional: 


Definition 3.1.1 (Empirical Fréchet Functional and Mean) The Fréchet functi- 


onal associated with measures U',..., uN € P(X) is 
Lee oi 
F:W(Z)>9R F= Lu) VEPZ) GD 


i=1 
A Fréchet mean of (u',...,) is a minimiser of F in Ws( 2) (if it exists). 


In analysis, a Fréchet mean is often called a barycentre. We shall use the terminol- 
ogy of “Fréchet mean” that is arguably more popular in statistics.” 

The factor | /(2N) is irrelevant for the definition of Fréchet mean. It is introduced 
in order to have simpler expressions for the derivatives (Theorems 3.1.14 and 3.2.13) 
and to be compatible with a population version EW; (y,A)/2 (see (3.3)). 

The first reference that deals with empirical Fréchet means in %4 (R?) is the sem- 
inal paper of Agueh and Carlier [2]. They treat the more general weighted Fréchet 
functional 


1 N ; N 
F(Y) = 5 Z wiW3 (nH), 0<w, Ywi=l, 
i=l i=l 


but, for simplicity, we shall focus on the case of equal weights. (If all the w,’s are 
rational, then the weighted functional can be encompassed in (3.1) by taking some 
of the p1'’s to be the same. The case of irrational w;’s is then treated with continuity 
arguments. Moreover, (3.3) encapsulates (3.1) as well as the weighted version when 
A can take finitely many values.) 


2 Interestingly, Fréchet himself [56] considered the Wasserstein metric between probability mea- 
sures on R, and some refer to this as the Fréchet distance (e.g., Dowson and Landau [44]), which 
is another reason to use this terminology. 


3.1 Empirical Fréchet Means in ~% 6l 


3.1.2 Multimarginal Formulation, Existence, and Continuity 


In [60], Gangbo and Święch consider the following multimarginal Monge- 
Kantorovich problem. Let u!,..., u~ be N measures in W(.2°) and let TT(u!,..., 
U”) be the set of probability measures in 2" having {u'}"_; as marginals. The 
problem is to minimise 


1 
G(x) = an Jav È lail dma) over we M(u!,...,n"). 


The factor 1/(2N7) is of course irrelevant for the minimisation and its purpose will 
be clarified shortly. If N = 2, we obtain the Kantorovich problem with quadratic 
cost. The probabilistic interpretation (as in Sect. 1.2) is that one is given random 
variables X,,...,Xj with marginal probability laws u',...,u% and seeks to con- 


struct a random vector Y = (¥;,...,Yy) on 2% such that X; 4 Y; and 


1 1 
aED IIe - Yl? < S GED lZ- zN. 
2N2 ig 2N2 if 
for any other random vector Z = (Z,,...,Zjy) such that X; 2 Z;. Intuitively, we seek 


a random vector with prescribed marginals but maximally correlated entries. 

We refer to elements of TI (u!,..., u^) (equivalently, joint laws of X4, ... , Xy) as 
multicouplings (of u!,...,u™). Just like in the Kantorovich problem, there always 
exists an optimal multicoupling 7. 

Let us now show how the multimarginal problem is equivalent to the problem 
of finding the Fréchet mean of !,...,u. The first thing to observe is that the 
objective function can be written as 


Ai Ln iia M(x) =M Sr 
(=) = fay Dl Camas. AND) MS era pp 


The next result shows that the Fréchet mean and the multicoupling problems are 
essentially the same. 


Proposition 3.1.2 (Fréchet Means and Multicouplings) Let u!,...,u’ <¢W(2’). 
Then u is a Fréchet mean of (u',...,u%) if and only if there exists an optimal 
multicoupling n € W(X) of (u',...,u%) such that u = M#r, and furthermore 


F(u) = G(m). 


Proof. Let 7 be an arbitrary multicoupling of (u!,...,u) and set u = M#7. Then 
(x ++ x;,M)#z is a coupling of u’ and u, and therefore 


[p -MPa > W? 1). 


Summation over i gives F(u) < G(r) and so inf F < infG. 
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For the other inequality, let u E€ Y (Z) be arbitrary. For each i, let zt be an 
optimal coupling between u and u’. Invoking the gluing lemma (Ambrosio and 
Gigli [10, Lemma 2.1]), we may glue all z'’s using their common marginal u. This 
procedure constructs a measure 7 on .2°*! with marginals u1,..., Uy, u and its 
relevant projection 7 is then a multicoupling of U1,..., Un. 

Since 2 is a Hilbert space, the minimiser of y > ¥ ||x; — y||? is y = M(x). Thus 


2 E: 
H= fya È lane) TOE x)|?dn (x,y) =G(n). 


In particular, inf F > inf G and combining this with the established converse inequal- 
ity we see that inf F = infG. Observe also that the last displayed inequality holds as 
equality if and only if y = M(x) n-almost surely, in which case u = M#z. Therefore, 
if u does not equal M#z, then F(u) > G(x) > F(M#r), and u cannot be optimal. 
Finally, if 7 is optimal, then 


F(M#r) < G(x) = infG = inf F 
establishing optimality of u = M#z and completing the proof. 


Since optimal couplings exist, we deduce that so do Fréchet means. 


Corollary 3.1.3 (Fréchet Means and Moments) Any finite collection of measures 
u',...,u% € H( 2) admits a Fréchet mean u, for all p > 1 


[leslie ance < FÈ f Ilario) 


and when p > | equality holds if and only if u! = -> = u”. 


Proof. Let x be a multicoupling of u',...,u" such that u = My#r (Proposi- 
tion 3.1.2). Then 


1A | 

[lean = fayn 
i baa 

= yd J balan’) 


The statement about equality follows from strict convexity of x +> ||x||? if p > 1. 


1 N 
aa ||P 
< FÈ fpe ant) 


A further corollary of Proposition 3.1.2 is a bound on the support: 


Corollary 3.1.4 The support of any Fréchet mean is included in the set 


1 N N 
suppu +--+: +suppH Xp XN i i 
7 = { N : Xi € suppu \ C conv (Ù suppu ) ; 
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In particular, if all the '’s are supported on a common convex set K, then so is any 
of their Fréchet means. 


The multimarginal formulation also yields a continuity property for the empirical 
Fréchet mean. Conditions for uniqueness will be given in the next subsection. 


Theorem 3.1.5 (Continuity of Fréchet Means) Suppose that Wz(u}, u’) > 0 for 
i=1,...,N and let T, denote any Fréchet mean of (Uy ,..-, Uj’). Then (T) stays in 
a compact set of W(X), and any limit point is a Fréchet mean of (u',...,L%). 


In particular, if u!,...,uY have a unique Fréchet mean ŢI, then T, — H in (X). 


Proof. We sketch the steps of the proof here, with the full details given on page 63 
of the supplement. 

Step 1: tightness of (1,). This is true because the collection of multicouplings is 
tight, and the mean function M is continuous. 

Step 2: weak limits are limits in W(.2°). This holds because the mean function 
has linear growth. 

Step 3: the limit is a Fréchet mean of (u!,...,u™). From Corollary 3.1.3, it 
follows that m, must be sought on some fixed bounded set in #(.2). On such 
sets, the Fréchet functionals are uniformly Lipschitz, so their minimisers converge 
as well. 


3.1.3 Uniqueness and Regularity 


A general situation in which Fréchet means are unique is when the Fréchet func- 
tional is strictly convex. In the Wasserstein space, this requires some regularity, 
but weak convexity holds in general. Absolutely continuous measures on infinite- 
dimensional 2% are defined in Definition 1.6.4. 


Proposition 3.1.6 (Convexity of the Fréchet Functional) Let A,y; © A(X) and 
t € [0,1]. Then 


WEEN + (1—1)p,A) < W (M,A) +0 =W (8, A). (3.2) 
When A is absolutely continuous, the inequality is strict unless t € {0,1} or y% = %. 


Remark 3.1.7 The Wasserstein distance is not convex along geodesics. That is, if 
we replace the linear interpolant ty, + (1 —t)% by McCann’s interpolant, then t +> 
W3(%,A) is not necessarily convex (Ambrosio et al. [12, Example 9.1.5]). 


Proof. Let m; € II(7,A) be optimal and notice that the linear interpolant tz, + (1 — 
t)m € H(t + (1 —t)y%,A), so that 


Wem +(-)wA)< f |le—ylPalem +0- m], 


64 3 Fréchet Means in the Wasserstein Space 


which is (3.2). When A is absolutely continuous and t € (0,1), equality in (3.2) 
holds if and only if m = tm + (1 — t)m = qr x i)#4. But 7 is supported 
on the graphs of two functions: t and ie Consequently, equality can hold only if 


these two maps equal A -almost surely, or, equivalently, if 7, = %2. 


As a corollary, we deduce that the Fréchet mean is unique if one of the measures 
u‘ is absolutely continuous, and this extends to the population version (see Proposi- 
tion 3.2.7). 

We conclude this subsection by stating an important regularity property in the 
Euclidean case. See Agueh and Carlier [2, Proposition 5.1] for a proof. 


Proposition 3.1.8 (L..-Regularity of Fréchet Means) Let u!,...,u" c€ %(R2) 
and suppose that u! is absolutely continuous with density bounded by M. Then 
the Fréchet mean of {u7} is absolutely continuous with density bounded by NIM 
and is consequently a Karcher mean. 


In Theorem 5.5.2, we extend Proposition 3.1.8 to the population level. 


3.1.4 The One-Dimensional and the Compatible Case 


When 2 = R, there is a simple expression for the Fréchet mean because #(R) 
can be imbedded in a Hilbert space. Indeed, recall that 


Wo(u,v) = Fu —Fy"Iz,(01) 


(see Sect. 2.3.2 or 1.5). In view of that, W(IR) can be seen as the convex closed 
subset of L2(0,1) formed by equivalence classes of left-continuous nondecreasing 
functions on (0,1): any quantile function is left-continuous and nondecreasing, and 
any such function G can be seen to be the inverse function of the distribution func- 
tion, the right-continuous inverse of G 


F(x) =inf{t € (0,1) : G(t) > x} = sup{t € (0,1) : G(t) < x}. 


(See, for example, Bobkov and Ledoux [25, Appendix A].) Therefore, the Fréchet 
mean of u!,..., u € W(IR) is the measure u having quantile function 


1 1 a 
Fi = yatu 


The Fréchet mean is thus unique. This is no longer true in higher dimension, unless 
some regularity is imposed on the measures (Proposition 3.2.7). 

Boissard et al. [28] noticed that compatibility of u!,...,u™ according to Def- 
inition 2.3.1 allows for a simple solution to the Fréchet mean problem, as in the 
one-dimensional case. Recall from Proposition 3.1.2 that this is equivalent to the 
multimarginal problem. Returning to the original form of G, we obtain an easy lower 
bound for any z € IT(u!,...,u): 
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1 
O) = zya yy Selb ail dn) > se DWP (ui a) 


because the (i, j)-th marginal of is a coupling of 1! and u’. Thus, if equality above 
holds for 2, then 7 is optimal and Min is the Fréchet mean by Proposition 3.1.2. 


N 
This is indeed the case for 7 = (i, t! ti , }#u! because the compatibility gives: 


ere =$ 


Jj 
Joli? ae(ar,....an) = ff lei 28 


=j je oti isle se). 


We may thus conclude, in a slightly more general form (y was u! above): 


Theorem 3.1.9 (Fréchet Mean of Compatible Measures) Suppose that {y,u', 
.., u} are compatible measures. Then 


aže 
— > ty |# 
Ne 


is the Fréchet mean of (u!,...,L). 


A population version is given in Theorem 5.5.3. 


3.1.5 The Agueh—Carlier Characterisation 


Agueh and Carlier [2] provide a useful sufficient condition for y to be the Fréchet 
mean. When 2° = Rf, this condition is also necessary [2, Proposition 3.8], hence 
characterising Fréchet means in R. It will allow us to easily deduce some equiv- 
ariance results for Fréchet means with respect to independence (Lemma 3.1.11) and 
rotations (3.1.12). More importantly, it provides a sufficient condition under which 
a local minimum of F is a global minimum (Theorem 3.1.15) and the same idea 
can be used to relate the population Fréchet mean to the expected value of the opti- 
mal maps (Theorem 4.2.4). Recall that @* denotes the Legendre transform of @, as 
defined on page 14. 


Proposition 3.1.10 (Fréchet Means and Potentials) Let u!,..., UY € A(X) be 
absolutely continuous, let y € W2(& ) and denote by 6; the convex potentials of thi 


Tf ġi = 6;* are such that 
ie La : 
T DAES 5 Ill, EX, with equality y-almost surely, 
=l 


then y is the unique Fréchet mean of u!,... u”. 
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Proof. Uniqueness follows from Proposition 3.2.7. If 0 E€ #(.2) is any measure, 
then the Kantorovich duality yields 


wn =f) (3-a) arot S, (Slit?) aw" 
WOM) > f (3-a) a+ f (3I -00)) wo) 


Summation over i gives the result. 


A population version of this result, based on similar calculations, is given in Theo- 
rem 4.2.4. 

The next two results are formulated in R? because then the converse of Propo- 
sition 3.1.10 is proven to be true. If one could extend [2, Proposition 3.8] to any 
separable Hilbert Z, then the two lemmata below will hold with R? replaced by 
2. The simple proofs are given on page 66 of the supplement. 


Lemma 3.1.11 (Independent Fréchet Means) Let u!,... u and v!,...,v% be 
absolutely continuous measures in W(R%) and W(R®) with Fréchet means u 
and v, respectively. Then the independent coupling u v is the Fréchet mean of 
wav!,..., uv v”. 


By induction (or a straightforward modification of the proof), one can show that the 
Fréchet mean of (uÍ @ vi Q p’) is U@v@p, and so on. 


Lemma 3.1.12 (Rotated Fréchet Means) Jf u is the Fréchet mean of the abso- 
lutely continuous measures U',..., uN and U is orthogonal, then U#u is the Fréchet 
mean of U#u',..., U#UN. 


3.1.6 Differentiability of the Fréchet Functional and Karcher 
Means 


Since we seek to minimise the Fréchet functional F, it would be helpful if F were 
differentiable, because we could then find at least local minima by solving the equa- 
tion F’ = 0. This observation of Karcher [78] leads to the notion of Karcher mean. 


Definition 3.1.13 (Karcher Mean) Let F be a Fréchet functional associated with 
some random measure A in W(X). Then y is a Karcher mean for A if F is differ- 
entiable at y and F'(y) = 0. 


Of course, if y is a Fréchet mean for the random measure A and F is differentiable at 
y, then F’(y) must vanish. In this subsection, we build upon the work of Ambrosio 
et al. [12] and determine the derivative of the Fréchet functional. This will not only 
allow for a simple characterisation of Karcher means in terms of the optimal maps ta 
(Proposition 3.2.14), but will also be the cornerstone of the construction of a steepest 
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descent algorithm for empirical calculation of Fréchet means. The differentiability 
holds at the population level too (Theorem 3.2.13). 

It turns out that the tangent bundle structure described in Sect. 2.3 gives rise to a 
differentiable structure in the Wasserstein space. Fix u? € W(.2) and consider the 
function 


1 
Fo: W(2%)>R,  Foly) = 5 We (7H). 
Ambrosio et al. [12, Corollary 10.2.7] show that when y is absolutely continuous, 


lim es h Ga (+) -x ty () -x) dy(x) 


=0. 
W2(v,7)0 Wr(v,7) 


Parts of the proof of this result (the limit superior above is < 0; the limit inferior 
is bounded below) are reproduced in Proposition 3.2.12. The integral above can be 
seen as the inner product 
0 
(ty -i,ty-i) 


in the space -%2 (y) that includes as a (closed) subspace the tangent space Tany. In 
terms of this inner product and the log map, we can write 


Fa(v) — P(Y) =- (log,(u°),logy(v)) +0(Wa(v.7)), v=>y in Ws, 
so that Fo is Fréchet-differentiable* at y with derivative 


0 
Ky) = —log,(u°) =— (e — i) € Tany. 


By linearity, one immediately obtains: 


Theorem 3.1.14 (Gradient of the Fréchet Functional) Fix a collection of mea- 

sures U!,..., uN E€ P(X). When y E€ W2( X) is absolutely continuous, the Fréchet 

functional 

N : 
WEB), VEP) 


1 
F(Y) = oH 


l 


is Fréchet-differentiable and 


14 i 1 Pe 
F'()=-y oei) = -y D (tr -ì). 


It follows from this that an absolutely continuous y E€ #(.2’) is a Karcher mean if 
and only if the average of the optimal maps is the identity. If in addition one u’ is 
absolutely continuous with bounded density, then the Fréchet mean H is absolutely 


3 The notion of Fréchet derivative is also named after Maurice Fréchet, but is not directly related 
to Fréchet means. 
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continuous by Proposition 3.1.8, so it is a Karcher mean. The result extends to the 
population version; see Proposition 3.2.14. 

It may happen that a collection u!,..., uM of absolutely continuous measures 
have a Karcher mean that is not a Fréchet mean; see Alvarez-Esteban et al. [9, Ex- 
ample 3.1] for an example in R?. But a Karcher mean y is “almost” a Fréchet mean 


in the following sense. By Proposition 3.2.14, N~! EE (x) = x for y-almost all x. 
If, on the other hand, the equality holds for all x € 2, then y is the Fréchet mean by 
taking integrals and applying Proposition 3.1.10. One can hope that under regular- 
ity conditions, the y-almost sure equality can be upgraded to equality everywhere. 
Indeed, this is the case: 


Theorem 3.1.15 (Optimality Criterion for Karcher Means) Let U C R? be an 
open convex set and let u!,...,u™ € (RI) be probability measures on U with 
bounded strictly positive densities g!,..., 8". Suppose that an absolutely continu- 
ous Karcher mean y is supported on U with bounded strictly positive density f there. 
Then y is the Fréchet mean of u!,...,u™ if one of the following holds: 


1. U =R’ and the densities f,g',...,g% are of class C°™ for some a > 0; 
2. U is bounded and the densities f ,g',...,g% are bounded below on U. 


Proof. The result exploits Caffarelli’s regularity theory for Monge-Ampére equa- 
tions in the form of Theorem 1.6.7. In the first case, there exist C ' (in fact, C>%) 


convex potentials @; on R? with t} = VọỌi, so that ty i (x) is a singleton for all x € R!. 
The set {x € Rf : EE (x) /N # x} is y-negligible (and hence Lebesgue negligible) 
and open by continuity. It is therefore empty, so F’(y) = 0 everywhere, and y is the 
Fréchet mean (see the discussion before the theorem). 

In the second case, by the same argument we have X t} (x)/N =x for all x € U. 
Since U is convex, there must exist a constant C such that ¥ g;(x) =C+N)|x||? /2 for 
all x € U, and we may assume without loss of generality that C = 0. If one repeats 
the proof of Proposition 3.1.10, then F (y) < F(@) for all © € P(U). By continuity 
considerations, the inequality holds for all 0 € P(U) (Theorem 2.2.7) and since U 
is closed and convex, y is the Fréchet mean by Corollary 3.1.3. 


3.2 Population Fréchet Means 


In this section, we extend the notion of empirical Fréchet mean to the population 
level, where A is a random element in W%(.2°) (a measurable mapping from a prob- 
ability space to #4(.2% )). This requires a different strategy, since it is not clear how 
to define the analogue of the multicouplings at that level of abstraction. However, it 
is important to point out that when there is more structure in A, multicouplings can 
be defined as laws of stochastic processes; see Pass [102] for a detailed account of 
the problem in this case. 
In analogy with (3.1), we define: 
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Definition 3.2.1 (Population Fréchet Mean) Let A be a random measure in 
W>( 2). The Fréchet mean of A is the minimiser (if it exists and is unique) of the 
Fréchet functional 


F(y)==EW;(7,A), ye W( 2). (3.3) 


Since W is continuous and nonnegative, the expectation is well-defined. 


3.2.1 Existence, Uniqueness, and Continuity 


Existence and uniqueness of Fréchet means on a general metric space M are rather 
delicate questions. Usually, existence proofs are easier: for example, since the 
Fréchet functional F is continuous on M (as we show below), one often invokes 
local compactness of M in order to establish existence of a minimiser. Unfortu- 
nately, a different strategy is needed when M = % (2°), because the Wasserstein 
space is not locally compact (Proposition 2.2.9). 

The first thing to notice is that F is indeed continuous (this is clear for the em- 
pirical version). This is a consequence of the triangle inequality and holds when 
W>( X ) is replaced by any metric space. 


Lemma 3.2.2 (Finiteness of the Fréchet Functional) Jf F is not identically infi- 
nite, then it is finite and locally Lipschitz everywhere on W(X). 


Proof. Assume that F is finite at y. If O is any other measure in #(2), write 


Since x < 1 +x? for all x, the triangle inequality in W(.2’) yields 


2|F (7) — F(8)| < Wa(y, @)[2EW2(y,A) + W2(8,7)] 
< W2(y, 8) [2 DW; (7,A) +2+W2(6,7)]. 


Since F (y) < ©, this shows that F is finite everywhere and the right-hand side van- 
ishes as © > yin #(2 ). Now that we know that F is continuous, the same upper 
bound shows that it is in fact locally Lipschitz. 


Example: let (a,) be a sequence of positive numbers that sum up to one. Let 
Xn = 1/a, and suppose that A equals 6{x,} E€ W(R) with probability a,. Then 


W2 (A, ô) = > anx? = by 1/an = %, 
n=1 n=1 


and by Lemma 3.2.2 F is identically infinite. Henceforth, we say that F is finite 
when the condition in Lemma 3.2.2 holds. 
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Using the lower semicontinuity (2.5), one can prove existence on R¢ rather easily. 
(The empirical means exist even in infinite dimensions by Corollary 3.1.3.) 


Proposition 3.2.3 (Existence of Fréchet Means) The Fréchet functional associ- 
ated with any random measure A in W3(R¢) admits a minimiser. 


Proof. The assertion is clear if F is identically infinite. Otherwise, let (y,) be a 
minimising sequence. We wish to show that the sequence is tight. Define L = 
sup, F (Yu) < œ and observe that since x < 1 +x? for all x € R, 


EW (W}, A) <1+EW?(%,A)<2L4+1, n=1,2,.... 


By the triangle inequality 


L' = EW (0, A) < W2 (ôo, 71) + EW2("1,A) < W2(o, 71) +2L+1 


so that for all n 


1/2 
(fi? ano) ) = W (Yn, 60) < EW2 (n, A) FEW (A, 60) <2L+14L' < œ. 


Since closed and bounded sets in R? are compact, it follows that (%) is a tight 
sequence. We may assume that y% — y weakly, then use (2.5) and Fatou’s lemma to 
obtain 


2F (y) = EW; (y,A) < Eliminf W3 (%, A) < liminfEW3 (%,A) = 2inf F. 


Thus, y is a minimiser of F, and existence is established. 


When 2 is an infinite-dimensional Hilbert space, existence still holds under a com- 
pactness assumption. We first prove a result about the support of the Fréchet mean. 
At the empirical level, one can say more about the support (see Corollary 3.1.4). 


Proposition 3.2.4 (Support of Fréchet Mean) Let A be a random measure in 
W( 2) and let KC X be a convex closed set such that P|A (K) = 1] = 1. If y 
minimises F, then y(K) = 1. 


Remark 3.2.5 For any closed KC X and any  € [0,1], the set {A E P(X): 
A(K) > a} is closed in W,( 2) because {A E P(X): A(K) > a} is weakly closed 
by the portmanteau lemma (Lemma 1.7.1). 


The proof amounts to a simple projection argument; see page 70 in the supplement. 


Corollary 3.2.6 If there exists a compact convex K satisfying the hypothesis of 
Proposition 3.2.4, then the Fréchet functional admits a minimiser supported on K. 


Proof. Proposition 3.2.4 allows us to restrict the domain of F to #(K), the collec- 
tion of probability measures supported on K. Since this set is compact in #4(.2) 
(Corollary 2.2.5), the result follows from continuity of F. 
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From the convexity (3.2), one obtains a simple criterion for uniqueness. See Def- 
inition 1.6.4 for absolute continuity in infinite dimensions. 


Proposition 3.2.7 (Uniqueness of Fréchet Means) Let A be a random measure in 
W>( 2) with finite Fréchet functional. If A is absolutely continuous with positive 
(inner) probability, then the Fréchet mean of A is unique (if it exists). 


Remark 3.2.8 It is not obvious that the set of absolutely continuous measures is 
measurable in W(X). We assume that there exists a Borel set A C W(X) such 
that P(A € A) > 0 and all measures in A are absolutely continuous. 


Proof. By taking expectations in (3.2), one sees that F is convex on #(.2°) with 
respect to linear interpolants. From Proposition 3.1.6, we conclude that 


1 
A absolutely continuous => yr WEA) strictly convex. 


As F was already shown to be weakly convex in any case, it follows that 
P(A absolutely continuous) >0 => F strictly convex. 


Since strictly convex functionals have at most one minimiser, this completes the 
proof. 


We state without proof an important consistency result (Le Gouic and Loubes 
[87, Theorem 3]). Since ⁄4( X) is a complete and separable metric space, we can 
define the “second degree” Wasserstein space #2(#(.2)). The law of a random 
measure A is in %4 (%2( X )) if and only if the corresponding Fréchet functional is 
finite. 


Theorem 3.2.9 (Consistency of Fréchet Means) Let A,,A be random measures 
in W,(R¢) with finite Fréchet functionals and laws P,,P € W2(W(R4)). If Pa + P 
in Ws(Wy(R2)), then any sequence Àn of Fréchet means of An has a W3-limit point 
À, which is a Fréchet mean of A. 


See the Bibliographical Notes for a more general formulation. 


Corollary 3.2.10 (Wasserstein Law of Large Numbers) Let A be a random mea- 
sure in W(R¢) with finite Fréchet functional and let A\,... be a sample from A. 
Assume À is the unique Fréchet mean of A (see Proposition 3.2.7). Then almost 
surely, the sequence of empirical Fréchet means of Aj,...,An converges to À. 


Proof. Let P be the law of A and let P, be its empirical counterpart (a random 
element in W(W(R“)). Like in the proof of Proposition 2.2.6 (with X replaced by 
the complete separable metric space W(R“)), almost surely P,, > P in¥™(M(R4)) 
and Theorem 3.2.9 applies. 

Under a compactness assumption, one can give a direct proof for the law of large 
numbers as in Theorem 3.1.5. This is done on page 71 in the supplement. 
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3.2.2 The One-Dimensional Case 


As a generalisation of the empirical version, we have: 


Theorem 3.2.11 (Fréchet Means in 7%(IR)) Let A be a random measure in W(R) 
with finite Fréchet functional. Then the Fréchet mean of A is the unique measure À 
with quantile function F; 1G) = LF, l(t), t € (0,1). 


Proof. Since L2(0,1) is a Hilbert space, the random element Fy! € L2(0,1) has a 
unique Fréchet mean g € L2(0, 1), defined by the relations (g, f) = E (Fy LP) for 
all f € L2(0,1). On page 72 of the supplement, we show that g can be identified 
with Fy. 


Interestingly, no regularity is needed in order for the Fréchet mean to be unique. 
This is not the case for higher dimensions, see Proposition 3.2.7. If there is some 
regularity, then one can state Theorem 3.2.11 in terms of optimal maps, because Fy i 
is the optimal map from Leb|jq 1; to A. If y € %(R) is any absolutely continuous 
(or even just continuous) measure, then Theorem 3.2.11 can be stated as follows: 
the Fréchet mean of A is the measure [ Oty J#y. A generalisation of this result to 
compatible measures (Definition 2.3.1) can be carried out in the same way, since 
compatible measures are imbedded in a Hilbert space, using the Bochner integrals 
for the definition of the expected optimal maps (see Sect. 2.4). 


3.2.3 Differentiability of the Population Fréchet Functional 


We now use the Fubini result (Proposition 2.4.9) in order to extend the differentiabil- 
ity of the Fréchet functional to the population version. This will follow immediately 
if we can interchange the expectation and the derivative in the form 


TORA shih. 2 
Fy) = EWV (A) =E ( 3W2) (7:4) = -E i) 
In order to do this, we will use dominated convergence in conjunction with uniform 
bounds on the slopes 


0.5W2(0,A)—0.5W2 (00, A)+ fot —i,t8 —i) dO 
u(@,A)=—— Wea) ~ wan 
(3.4) 


Proposition 3.2.12 (Slope Bounds) Let 09, A, and 0 be probability measures with 
Oo absolutely continuous, and set 6 = W2(0, 00). Then 


55 —Wo( 4, A) — /2W2( 0,60) +2W2(4, ô) < u(8,A) < 58, 


NI = 
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where u is defined by (3.4). If the measures are compatible in the sense of Defini- 
tion 2.3.1, then u(0,A) = 6/2. 


The proof is a slight variation of Ambrosio et al. [12, Theorem 10.2.2 and Proposi- 
tion 10.2.6], and the details are given on page 72 of the supplement. 


Theorem 3.2.13 (Population Fréchet Gradient) Let A be a random measure with 
finite Fréchet functional F. Then F is Fréchet-differentiable at any absolutely con- 
tinuous Qp in the Wasserstein space, and F'(@)) = tg, —i € L4(09). More precisely, 


F (0) — F() + Ja (Eta — i,t, — i) dO 


10, O03 ins. 
W2(0, 8) g 


Thus, the Fréchet derivative of F can be identified with the map —( ot, —i) in the 
tangent space at Qo, a subspace of (80). 


Proof. Introduce the slopes u(@,A) defined by (3.4). Then for all A,u(@,A) + 0 
as W2(0, A) — 0, by the differentiability properties established above. Let us show 
that Eu(@,A) — 0 as well. By Proposition 3.2.12, the expectation of u is bounded 
above by a constant that does not depend on A, and below by the negative of 


1Wa (00, A) +E4/2W? (6, 60) +2W2(A , do) 


< V2W (0o, 5p) EW: (800,4) + V2EW (A , ôo). 


Both expectations are finite by the hypothesis on A because the Fréchet functional 
is finite. The dominated convergence theorem yields 


F (0) — F(@) +Efy(tg, — i th, — i) dO 
W2(8, 6) 


> 0, W2(80,0) > 0. 


zu(0, A) = 


The measurability of the integral and the result then follow from Fubini’s theorem 
(see Proposition 2.4.9). 


Proposition 3.2.14 Let A be a random measure in W2( X ) with finite Fréchet func- 
tional F, and let y be absolutely continuous in W2(2). Then y is a Karcher mean 
of A if and only if ity —i=0 in 2% (y). Furthermore, if y is a Fréchet mean of A, 
then it is also a Karcher mean. 


The characterisation of Karcher means follows immediately from Theorem 3.2.13. 
The other statement is that the derivative vanishes at the minimum, which is fairly 
obvious intuitively; see page 73 in the supplement. 
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3.3 Bibliographical Notes 


Proposition 3.1.2 is essentially due to Agueh and Carlier [2, Proposition 4.2], who 
show it on R¢ (see also Zemel and Panaretos [134, Theorem 2]). An earlier result in 
a compact setting can be found in Carlier and Ekeland [33]. The formulation given 
here is from Masarotto et al. [91]. A more general version is provided by Le Gouic 
and Loubes [87, Theorem 8]. 

Lemmata 3.1.11 and 3.1.12 are from [135], but were known earlier (e.g., Bonneel 
et al. [30]). 

Proposition 3.1.6 is a simplified version of Alvarez-Esteban et al. [8, Theo- 
rem 2.8] (see [8, Corollary 2.9]). 

Propositions 3.2.3 and 3.2.7 are from Bigot and Klein [22], who also show the 
law of large numbers (Corollary 3.2.10) and deal with the one-dimensional setup 
(Theorem 3.2.11) in a compact setting. Section 2.4 appears to be new, but see the 
discussion in its beginning for other measurability results. 

Barycentres can be defined for any p > 1 as the measures minimising +> 
wp (A, LL). (Strictly speaking, these are not Fréchet means unless p = 2.) Le Gouic 
and Loubes [87] show Proposition 3.2.3 and Theorem 3.2.9 in this more general 
setup, where R¢ can be replaced by any separable locally compact geodesic space. 
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Why is it relevant to construct the Fréchet mean of a collection of measures with 
respect to the Wasserstein metric? A simple answer is that this kind of average will 
often express a more natural notion of “typical” realisation of a random probabil- 
ity distribution than an arithmetic average.! Much more can be said, however, in 
that the Wasserstein—Fréchet mean and the closely related notion of an optimal mul- 
ticoupling arise canonically as the appropriate framework for the formulation and 
solution to the problem of separation of amplitude and phase variation of a point 
process. It would almost seem that Wasserstein—Fréchet means were “made” for 
precisely this problem. 

When analysing the (co)variation of a real-valued stochastic process {Y (x) : x € 
K} over a convex compact domain K, it can be broadly said that one may distinguish 
two layers of variation: 


e Amplitude variation. This is the “classical” variation that one would also en- 
counter in multivariate analysis, and refers to the stochastic fluctuations around a 
mean level, usually encoded in the covariance kernel, at least up to second order. 


In short, this is variation “in the y-axis” (ordinate). 


Electronic Supplementary Material The online version of this chapter (https://doi.org/10.1007/ 
978-3-030-38438-8_4) contains supplementary material. 


' For instance, the arithmetic average of two scalar Gaussians N({11,1) and N({2, 1) will be their 
mixture with equal weights, but their Fréchet—Wasserstein average will be the Gaussian N( 5 Hi + 
5 Lo, 1) (see Lemma 4.2.1), which is arguably more representative from an intuitive point of view. 
In much the same way, the Fréchet—Wasserstein average of probability measures representing some 
type of object (e.g., normalised greyscale images of faces) will also be an object of the same type. 
This sort of phenomenon is well-known in manifold statistics, more generally, and is arguably one 
of the key motivations to account for the non-linear geometry of the sample space, rather than 
imbed it into a larger linear space and use the addition operation. 
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e Phase variation. This is a second layer of non-linear variation peculiar to contin- 
uous domain stochastic processes, and is rarely—if ever—encountered in multi- 
variate analysis. It arises as the result of random changes (or deformations) in the 
time scale (or the spatial domain) of definition of the process. It can be concep- 
tualised as a composition of the stochastic process with a random transformation 
(warp map) acting on its domain. 


This is variation “in the x-axis” (abscissa). 


The terminology on amplitude/phase variation is adapted from random trigonomet- 
ric functions, which may vary in amplitude (oscillations in the range of the function) 
or phase (oscillations in the domain of the function). Failing to properly account for 
the superposition of these two forms of variation may entirely distort the findings 
of a statistical analysis of the random function (see Sect. 4.1.1). Consequently, it is 
an important problem to be able to separate the two, thus correctly accounting for 
the distinct contribution of each. The problem of separation is also known as that of 
registration (Ramsay and Li [108]), synchronisation (Wang and Gasser [129]), or 
multireference alignment (Bandeira et al. [16]), though in some cases these terms 
refer to a simpler problem where there is no amplitude variation at all. 

Phase variation naturally arises in the study of random phenomena where there 
is no absolute notion of time or space, but every realisation of the phenomenon 
evolves according to a time scale that is intrinsic to the phenomenon itself, and (un- 
fortunately) unobservable. Processes related to physiological measurements, such 
as growth curves and neuronal signals, are usual suspects. Growth curves can be 
modelled as continuous random functions (functional data), whereas neuronal sig- 
nals are better modelled as discrete random measures (point processes). We first 
describe amplitude/phase variation in the former’ case, as that is easier to appreci- 
ate, before moving on to the latter case, which is the main subject of this chapter. 


4.1 Amplitude and Phase Variation 


4.1.1 The Functional Case 


Let K denote the unit cube [0, 1] C R®. A real random function Y = (Y (x) : x € K) 
can, broadly speaking, have two types of variation. The first, amplitude varia- 
tion, results from Y(x) being a random variable for every x and describes its fluc- 
tuations around the mean level m(x) = EY(x), usually encoded by the variance 
varY (x). For this reason, it can be referred to as “variation in the y-axis”. More 


2 As the functional case will only serve as a motivation, our treatment of this case will mostly 
be heuristic and superficial. Rigorous proofs and more precise details can be found in the books 
by Ferraty and Vieu [51], Horvath and Kokoszka [70], or Hsing and Eubank [71]. The notion 
of amplitude and phase variation is discussed in the books by Ramsay and Silverman [109, 110] 
that are of a more applied flavour. One can also consult the review by Wang et al. [127], where 
amplitude and phase variation are discussed in Sect. 5.2. 
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generally, for any finite set x1,...,x,, the n x n covariance matrix with entries 
K(xi,x;) = cov[Y (xj), ¥(x;)] encapsulates (up to second order) the stochastic de- 
viations of the random vector (Y(x1),...,¥(%)) from its mean, in analogy with the 
multivariate case. Heuristically, one then views amplitude variation as the collection 
k(x, y) for x,y € K in a sense we discuss next. 

One typically views Y as a random element in the separable Hilbert space L,(K), 
assumed to have E||Y ||? < ~ and continuous sample paths, so that in particular Y (x) 
is a random variable for all x € K. Then the mean function 


m(x) =EY(x), xEK 


and the covariance kernel 
(x,y) =cov[¥(x),Y(y)], x,y € K 


are well-defined and finite; we shall assume that they are continuous, which is equiv- 
alent to Y being mean-square continuous: 


Yo-Yo, y>x. 


The covariance kernel « gives rise to the covariance operator 2 : L(K) — L2(K), 
defined by 


(BA) = | xe.) fe) ae, 


a self-adjoint positive semidefinite Hilbert-Schmidt operator on L2 (K). The justi- 
fication to this terminology is the observation that when m = 0, for all bounded 


Jag = L(K), 


YA Ye) =EL f YOYE] = S2020); 
and so, without the restriction to m = 0, 


cov iY. f) 8) = f @ (BNO) dy = (6, BN). 
The covariance operator admits an eigendecomposition (rk, 6, )7_, such that rg N 0, 
Zok = reo, and (Qp) is an orthonormal basis of L2 (K). One then has the celebrated 
Karhunen—Loéve expansion 


¥(x) = m(x) + S Y -m dy) de(x) = m(x) + S Entel). 
k=1 


A major feature in this expansion is the separation of the functional part from the 
stochastic part: the functions @,(x) are deterministic; the random variables &, are 
scalars. This separation actually holds for any orthonormal basis; the role of choos- 
ing the eigenbasis of Z is making €; uncorrelated: 


78 4 Phase Variation and Fréchet Means 


cov(Ex, 61) = cov [(¥, bx) , (Y, 1) ] = (1, Z Ok) 


vanishes when k # / and equals r otherwise. For this reason, it is not surprising 
that using as ġ the eigenfunctions yields the optimal representation of Y. Here, 
optimality is with respect to truncations: for any other basis (y;,) and any M, 


2 
> 


2 


M M 
Mek = m, Yk) Y m— 2 — mbt) be 


so that (@,) provides the best finite-dimensional approximation to Y. The approxi- 
mation error on the right-hand side equals 


2 
Y Sebi 


k=M+1 


and depends on how quickly the eigenvalues of # decay. 
One carries out inference for m and «K on the basis of a sample Yj,...,Y, by 


1 n 

)=7 2na), xEK 
and 

=- E ONYO) -AM), 


from which one proceeds to estimate # and its eigendecomposition. 

We have seen that amplitude variation in the sense described above is linear and 
dealt with using linear operations. There is another, qualitatively different type of 
variation, phase variation, that is non-linear and does not have an obvious finite- 
dimensional analogue. It arises when in addition to the randomness in the values 
Y (x) itself, an extra layer of stochasticity is present in its domain of definition. In 
mathematical terms, there is a random invertible warp function (sometimes called 
deformation or warping) T : K + K and instead of Y (x), one observes realisations 
from 


¥(x)=Y(T1(x)), x€K. 


For this reason, phase variation can be viewed as “variation in the x-axis”. When d = 
1, the set K is usually interpreted as a time interval, and then the model stipulates that 
each individual has its own time scale. Typically, the warp function is assumed to be 
a homeomorphism of K independent of Y and often some additional smoothness is 
imposed, say T € CĈ. One of the classical examples is growth curves of children, of 
which a dataset from the Berkeley growth study (Jones and Bayley [73]) is shown 
in Fig.4.1. The curves are the derivatives of the height of a sample of ten girls 
as a function of time, from birth until age 18. One clearly notices the presence 
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of the two types of variation in the figure. The initial velocity for all children is 
the highest immediately or shortly after birth, and in most cases decreases sharply 
during the first 2 years. Then follows a period of acceleration for another year or 
so, and so on. Despite presenting qualitatively similar behaviour, the curves differ 
substantially not only in the magnitude of the peaks but also in their location. For 
instance, one red curve has a local minimum at the age of three, while a green one 
has a local maximum at almost that same time point. It is apparent that if one tries 
to estimate the mean function by averaging the curves at each time x, the shape 
of the resulting estimate would look very different from each of the curves. Thus, 
this pointwise averaging (known as the cross-sectional mean) fails to represent the 
typical behaviour. This phenomenon is seen more explicitly in the next example. 
The terminology of amplitude and phase comes from trigonometric functions, from 


growth rate (cm / year) 


Fig. 4.1: Derivatives of growth curves of ten girls from the Berkeley dataset. The 
data and the code for the figure are from the R package fda (Ramsay et al. [111]) 


which we derive an artificial example that illustrates the difficulties of estimation 
in the presence of phase variation. Let A and B be symmetric random variables and 
consider the random function 


Ÿ (x) =Asin[8a(x+B)]. (4.1) 


(Strictly speaking, x ++ x+ B is not from (0, 1] to itself; for illustration purposes, we 
assume in this example that K = R.) The random variable A generates the amplitude 
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variation, while B represents the phase variation. In Fig. 4.2, we plot four realisations 
and the resulting empirical means for the two extreme scenarios where B = 0 (no 
phase variation) or A = 1 (no amplitude variation). In the left panel of the figure, 
we see that the sample mean (in thick blue) lies between the observations and has a 
similar form, so can be viewed as the curve representing the typical realisation of the 
random curve. This is in contrast to the right panel, where the mean is qualitatively 
different from all curves in the sample: though periodicity is still present, the peaks 
and troughs have been flattened, and the sample mean is much more diffuse than 
any of the observations. 


Fig. 4.2: Four realisations of (4.1) with means in thick blue. Left: amplitude varia- 
tion (B = 0); right: phase variation (A = 1) 


The phenomenon illustrated in Fig. 4.2 is hardly surprising, since as mentioned 
earlier amplitude variation is linear while phase variation is not, and taking sample 
means is a linear operation. Let us see in formulae how this phenomenon occurs. 
When A = | we have 


bY (x) = sin(87x)E[cos(87B)] + cos(82x)E[sin(87B)]. 


Since B is symmetric the second term vanishes, and unless B is trivial the expectation 
of the cosine is smaller than one in absolute value. Consequently, the expectation of 
Y(x) is the original function sin 87x multiplied by a constant of magnitude strictly 
less than one, resulting in peaks of smaller magnitude. 

In the general case, where Y(x) = Y(T~!(x)) and Y and T are independent, we 
have 


and 
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From this, several conclusions can be drawn. Let ñ = u (T~! (x)) be the conditional 
mean function given T. Then the value of the mean function itself, Ef, at xq is 
determined not by a single point, say x, but rather by all the values of m at the 
possible outcomes of 7~! (x). In particular, if xo was a local maximum for m, then 
S| (xo)] will typically be strictly smaller than m(xq); the phase variation results in 
smearing m. 

At this point an important remark should be made. Whether or not phase variation 
is problematic depends on the specific application. If one is interested indeed in the 
mean and covariance functions of Y, then the standard empirical estimators will be 
consistent, since Y itself is a random function. But if it is rather m, the mean of Y, 
that is of interest, then the confounding of the amplitude and phase variation will 
lead to inconsistency. This can also be seen from the formula 


P(x) = m(T-"(x)) + > Exe(T~!(x)). 
=i 


The above series is not the Karhunen-Loève expansion of Y; the simplest way to 
notice this is the observation that ¢(T~'(x)) includes both the functional compo- 
nent ¢; and the random component T~!(x). The true Karhunen—Loéve expansion 
of Ÿ will in general be qualitatively very different from that of Y, not only in terms 
of the mean function but also in terms of the covariance operator and, consequently, 
its eigenfunctions and eigenvalues. As illustrated in the trigonometric example, the 
typical situation is that the mean EY is more diffuse than m, and the decay of the 
eigenvalues 7; of the covariance operator is slower than that of rg; as a result, one 
needs to truncate the sum at high threshold in order to capture a substantial enough 
part of the variability. In the toy example (4.1), the Karhunen—Loéve expansion has 
a single term besides the mean if B = 0, while having two terms if A = 1. 

When one is indeed interested in the mean m and the covariance k, the random 
function T pertaining to the phase variation is a nuisance parameter. Given a sample 
Y; =Y;0 T i=1,...,n, there is no point in taking pointwise means of Y;, because 
the curves are misaligned; i (x) = Yı (T7 ' (x)) should not be compared with (x), 
but rather with ¥2(7,_'(x)) = (T7! (Ta (x)). To overcome this difficulty, one seeks 


estimators 7; such that 


2 


¥ilz) =A) =A EE) 


is approximately Y;(x). In other words, one tries to align the curves in the sample 
to have a common time scale. Such a procedure is called curve registration. Once 
registration has been carried out, one proceeds the analysis on Y;(x) assuming only 
amplitude variation is now present: estimate the mean m by 
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and the covariance K by its analogous counterpart. Put differently, registering the 
curves amounts to separating the two types of variation. This step is crucial regard- 
less of whether the warp functions are considered as nuisance or an analysis of the 
warp functions is of interest in the particular application. 

There is an obvious identifiability problem in the model Y = Yo T~!. If S is any 
(deterministic) invertible function, then the model with (Y,T) is statistically indis- 
tinguishable from the model with (Y o S,T oS). It is therefore often assumed that 
T =i is the identity and in addition, in nearly all application, that T is monotoni- 
cally increasing (if d = 1). 

Discretely observed data. One cannot measure the height of person at every sin- 
gle instant of her life. In other words, it is rare in practice that one has access to the 
entire curve. A far more common situation is that one observes the curves discretely, 
i.e., at a finite number of points. The conceptually simplest setting is that one has 
access to a grid x;,...,x; E€ K, and the data come in the form 


Jij = ¥i(t)), 
possibly with measurement error. The problem is to find, given ¥;;, consistent esti- 
mators of 7; and of the original, aligned functions Y;. 

In the bibliographic notes, we review some methods for carrying out this sepa- 
ration of amplitude and phase variation. It is fair to say that no single registration 
method arises as the canonical solution to the functional registration problem. In- 
deed, most need to make additional structural and/or smoothness assumptions on 
the warp maps, further to the basic identifiability conditions requiring that T be in- 
creasing and that ET equal the identity. We will eventually see that, in contrast, the 
case of point processes (viewed as discretely observed random measures) admits a 
canonical framework, without needing additional assumptions. 


4.1.2 The Point Process Case 


A point process is the mathematical object that represents the intuitive notion of a 
random collection of points in a space 2. It is formally defined as a measurable 
map IT from a generic probability space into the space of (possibly infinite) Borel 
integer-valued measures of Z in such a way that IT(B) is a measurable real-valued 
random variable for all Borel subsets B of 2. The quantity TI (B) represents the 
random number of points observed in the set B. Among the plethora of books on 
point processes, let us mention Daley and Vere-Jones [41] and Karr [79]. Kallenberg 
[75] treats more general objects, random measures, of which point processes are a 
peculiar special case. We will assume for convenience that IT is a measure on a 
compact subset K C R. 
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Amplitude variation of IT can be understood in analogy with the functional case. 
One defines the mean measure 


A(A) = E[TT(A)], A C K Borel 


and, provided that E[IT(K)]|? < æ, the covariance measure 


(A,B) = cov[II(A), TI(B)] = E[I(A)I(B)] — 2(A)A(B), 


the latter being a finite signed Borel measure on K. Just like in the functional case, 
these two objects encapsulate the (second-order) amplitude variation? properties of 
the law of IT. 

Given a sample IT),..., IT, of independent point processes distributed as IT, the 
natural estimators 


n ee 1 n 


A(A) =- my (A,B) = - 2 T(A)(B) —1(A)A(B), 


are consistent and the former asymptotically normal [79, Proposition 4.8]. 

Phase variation then pertains to a random warp function T : K — K (independent 
of IT) that deforms IT: if we denote the points of IT by x1,...,xx (with K random), 
then instead of (x;), one observes T(x1),...,T (xx). In symbols, this means that the 
data arise as IT = T#IT. We refer to TT as the original point processes, and Ti as the 
warped point processes. An example of 30 warped and unwarped point processes 
is shown in Fig. 4.3. The point patterns in both panels present a qualitatively simi- 
lar structure: there are two peaks of high concentration of points, while few points 
appear between these peaks. The difference between the two panels is in the posi- 
tion and concentration of those peaks. In the left panel, only amplitude variation is 
present, and the location/concentration of the peaks is the same across all observa- 
tions. In contrast, phase variation results in shifting the peaks to different places for 
each of the observations, while also smearing or sharpening them. Clearly, estima- 
tion of the mean measure of a subset A by averaging the number of observed points 
in A would not be satisfactory as an estimator of A when carried out with the warped 
data. As in the functional case, it will only be consistent for the measure A defined 
by 


A(A)=E[A(T'(A))], ACZ, 


and A = {[T#A] misses most (or at least a significant part) of the bimodal structure 
of À and is far more diffuse. 


3 If the cumulative count process T(t) = I[0,t) is mean-square continuous, then the use of the 
term “amplitude variation” can be seen to remain natural, as T(t) will admit a Karhunen—Loéve 
expansion, with all stochasticity being attributable to the random amplitudes in the expansion. 
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Since IT and T are independent, the conditional expectation of Tl given T is 


IF (A)IT] = E[M(7~!(A))|T] = A(T! (A)) = [T#A](A). 


Consequently, we refer to A = T#A as the conditional mean measure. The problem 
of separation of amplitude and phase variation can now be stated as follows. On the 
basis of a sample IT,,..., Hn, find estimators of (7;) and (I7;). Registering the point 


processes amounts to constructing estimators, registration maps T , such that the 
aligned points 


ees, 


=T Hih = i oT] #0; 


are close to the original points Ij. 


g 4 o aano o anoo 2J o amea co =o 
© ammo o o o ommo o Ar = 
ooœwamooo o o o ooœapom0 o o a amoo o 
© amam o © © © occpoameno o @00 omo o o owo 
lo o ommo o oo O — moo lo © omm o 00 o Coc CO 
A o aoaw ‘cama NT comp comm 
amum 00 oo œ o ammoo 
CCE o oo o mawo oomoo co D am como 
con ao © amm œ oova mo omo 
o omamo o o o@amo oo o z 
AT o o oo omw o A] am o 00 o wo ao o 
awo o o amoo o 
œo © o amom œ momoo oo o o omom © 
omamo o oo o mowo o wamo o oo omav o 
lo © amo o œ o  oammoo lo 
= 4 © ammooo o a cæ > œwammwoo 
amman O o 000 ooma o capcom o o amm o 
o apam o o oammaw w maoa 
o amooo o o o0 amwa o amaw oomoo o oo am 
o O o oma o o o amomo o o o o ommawoo o o amw 
= o =E ocoomanmoo o o 
© onmamoo o oo o cooooaan o ommowaoo o o o omp 
ammo ao o 
o omm o ow æ 
© c@mmao oo o œ ommno © omma wo o o om 
wo 7 cmammso0 0 0 © o omoammoo wo 4 
commo o œ ama oo maamoow o o 
amamoomw o @ wamom mo 00 moomo 
amamma o o o ammo o o o amo 
omw o o coammam o © amooo o o aw 
Occi T T T T T T T T eT T T T T T T T T 
-16 -12 -8 -4 0 4 8 12 16 -16 -12 -8 -4 0 4 8 12 16 


Fig. 4.3: Unwarped (left) and warped Poisson point processes 


Remark 4.1.1 (Poisson Processes) A special but important case is that of a Pois- 
son process. Gaussian processes probably yield the most elegant and rich theory 
in functional data analysis, and so do Poisson processes when it comes to point 
processes. We say that II is a Poisson process when the following two conditions 
hold. (1) For any disjoint collection (A,,...,An) of Borel sets, the random variables 
TI(A1),..., (An) are independent; and (2) for every Borel A C K, TI(A) follows a 
Poisson distribution with mean À (A): 


aay [A (A) 
P(II(A) =k) =e a AT 


Conditional upon T, the random variables TI(Ax) =H" (A,)), k= 1,...,n are 
independent as the sets (T7! (A;)) are disjoint, and TI (A) follows a Poisson distribu- 
tion with mean à (TT! (A)) = A (A). This is precisely the definition of a Cox process: 


conditional upon the driving measure A, TI is a Poisson process with mean measure 
À. For this reason, it is also called a doubly stochastic process; in our context, the 
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phase variation is associated with the stochasticity of A while the amplitude one is 
associated with the Poisson variation conditional upon A. 


As in the functional case there are problems with identifiability: the model (IT, T) 

cannot be distinguished from the model (S#IT,T oS~!) for any invertible S : K > K. 
It is thus natural to assume that ET is the identity map* (otherwise set $ = ET, i.e., 
replace IT by [ET]#II and T by T o[ET]~!). 
_ Constraining T to have mean identity is nevertheless not sufficient for the model 
II = T#IT to be identifiable. The reason is that given the two point sets IT and IT, 
there are many functions that push forward the latter to the former. This ambiguity 
can be dealt with by assuming some sort of regularity or parsimony for T. For ex- 
ample, when K = [a,b] is a subset of the real line, imposing T to be monotonically 
increasing guarantees its uniqueness. In multiple dimensions, there is no obvious 
analogue for increasing functions. One possible definition is the monotonicity de- 
scribed in Sect. 1.7.2: 


(T(y) —T(x),y—x) = 0, x,y EK. 


This property is rather weak in a sense we describe now. Let K C R? and write y > x 
if and only if y; > x; for i = 1,2. It is natural to expect the deformations to maintain 
the lexicographic order in R?: 


yx => Ty) >T(a). 


If we require in addition that the ordering must be preserved for all quadrants: for 
z=T(x) andw=T(y) 


{y1 > X1,y2 < x2} = {wi > 21,W2 <2}, 


then monotonicity is automatically satisfied. In that sense, it is arguably not very 
restrictive. 

Monotonicity is weaker than cyclical monotonicity (see (1.10) with y; = T(x;)), 
which is itself equivalent to the property of being the subgradient of a convex 
function. But if extra smoothness is present and T is a gradient of some function 
$ : K —R, then ọ must be convex and T is then cyclically monotone. Consequently, 
we will make the following assumptions: 


e the expected value of T is the identity; 
e T isa gradient of a convex function. 


In the functional case, at least on the real line, these two conditions are imposed 
on the warp functions in virtually all applications, often accompanied with addi- 
tional assumptions about smoothness of T, its structural properties, or its distance 
from the identity. In the next section, we show how these two conditions alone lead 
to the Wasserstein geometry and open the door to consistent, fully nonparametric 
separation of the amplitude and phase variation. 


4 This can be defined as Bochner integral in the space of measurable bounded T : K > K. 
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4.2 Wasserstein Geometry and Phase Variation 


4.2.1 Equivariance Properties of the Wasserstein Distance 


A first hint to the relevance of Wasserstein metrics in Y,(.2°) for deformations of 
the space 2 is that for all p > 1 and all x,y E X, 


W, (6x, Ô) = llx—yIl, 


where 6, is as usual the Dirac measure at x € 2. This is in contrast to metrics such 
as the bounded Lipschitz distance (that metrises weak convergence) or the total 
variation distance on P(X). Recall that these are defined by 


[eau- f pav; 


u —v||sL = sup 
l@llBL<1 


lu =v = snp PA) vA), 


so that 


1 xÆy 


||: — ð l|sL = min(1, 
0 x=y. 


); 15 3w =| 


In words, the total variation metric “does not see the geometry” of the space Z. 
This is less so for the bounded Lipschitz distance that does take small distances into 
account but not large ones. 

Another property (shared by BL and TV) is equivariance with respect to transla- 
tions. It is more convenient to state it using the probabilistic formalism of Sect. 1.2. 
Let X ~ u and Y ~ v be random elements in 2, a be a fixed point in 2°, X’=X +a 
and Y’ = Y +a. Joint couplings Z’ = (X',Y’) are precisely those that take the form 
(a,a) +Z for a joint coupling Z = (X,Y). Thus 


W,, (U * Ôa, V * Og) = W,(X' +4,Y' +a) =W,(X,Y) =W, (U,V), 


where 6, is a Dirac measure at a and * denotes convolution. 
This carries over to Fréchet means in an obvious way. 


Lemma 4.2.1 (Fréchet Means and Translations) Let A be a random measure in 
W>( 2) with finite Fréchet functional and a € X. Then y is a Fréchet mean of A if 
and only if y* Ôa is a Fréchet mean of A * ôa. 


The result holds for other values of p, in the formulation sketched in the bibli- 
ographic notes of Chap.2. In the quadratic case, one has a simple extension to 
the case where only one measure is translated. Denote the first moment (mean) of 


LEW (2X) b 


m: W(X) > K m(u) =f xino). 
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(When % is infinite-dimensional, this can be defined as the unique element m € 2 
satisfying 


(my) = fey) du), yeZ) 
By an equivalence of couplings similar to above, we obtain 
Wy (H * Ôa, V) = W3 (u, v) + (a — [m(u) — m(v)])” — [m(w) —m(v))’, 
which is minimised at a = m() — m(v). This leads to the following conclusion: 


Proposition 4.2.2 (First Moment of Fréchet Mean) Let A be a random measure 
in W(X) with finite Fréchet functional and Fréchet mean y. Then 


ce) =E Jaw] ; 


4.2.2 Canonicity of Wasserstein Distance in Measuring 
Phase Variation 


The purpose of this subsection is to show that the standard functional data analysis 
assumptions on the warp function T, having mean identity and being increasing, are 
equivalent to purely geometric conditions on T and the conditional mean measure 
A = T#A. Put differently, if one is willing to assume that ET = i and that T is 
increasing, then one is led unequivocally to the problem of estimation of Fréchet 
means in the Wasserstein space #4(.2). When 2 # R, “increasing” is interpreted 
as being the gradient of a convex function, as explained at the end of Sect. 4.1.2. 

The total mass A (.2°) is invariant under the push-forward operation, and when it 
is finite, we may assume without loss of generality that it is equal to one, because all 
the relevant quantities scale with the total mass. Indeed, if A = ty with u probability 
measure and T > 0, then T#A = T x T#u, and the Wasserstein distance (defined as 
the infimum-over-coupling integrated cost) between tu and Tv is TW,(u,V) for 
U,V probabilities. 

We begin with the one-dimensional case, where the explicit formulae allow for a 
more transparent argument, and for simplicity we will assume some regularity. 


Assumptions 2 The domain K C R is a nonempty compact convex set (an interval), 
and the continuous and injective random map T : K — R (a random element in 
C,(K)) satisfies the following two conditions: 


(Al) Unbiasedness: E[T (x)| = x for all x € K. 
(A2) Regularity: T is monotone increasing. 


The relevance of the Wasserstein geometry to phase variation becomes clear in the 
following proposition that shows that Assumptions 2 are equivalent to geometric 
assumptions on the Wasserstein space W(R). 
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Proposition 4.2.3 (Mean Identity Warp Functions and Fréchet Means in 7% (R)) 
Let 6 C K C R compact and convex and T : K + R continuous. Then Assumptions 2 
hold if and only if, for any A € Ws(K) supported on K such that E[W}(T#A,A)| < œ, 
the following two conditions are satisfied: 


(B1) Unbiasedness: for any 0 € %(R) 


LW (THA ,A)] < E[W3 (THA, 0)]. 


(B2) Regularity: if Q : K > R is such that T#A = Q#A, then with probability one 


[|re@-sf aws f ow- aa, almost surely. 


These assumptions have a clear interpretation: (B1) stipulates that A is a Fréchet 
mean of the random measure A = T#4 , while (B2) states that T must be the optimal 
map from À to A, that is, T = i 


Proof. If T satisfies (B2) then, as an optimal map, it must be nondecreasing À- 
almost surely. Since A is arbitrary, T must be nondecreasing on the entire domain 
K. Conversely, if T is nondecreasing, then it is optimal for any A. Hence (A2) and 
(B2) are equivalent. 

Assuming (A2), we now show that (A1) and (B1) are equivalent. Condition (B1) 
is equivalent to the assertion that for all © € #(R), 


Epa — Fy "I3,0,1) =EIW3 (T#2.,A)] SE[W3(THA, 0)] =EllFpgh — Fe 'I2,00,1; 


Fra] !] = E[F, 1] = Fy l (see Sect. 3.1.4). Con- 
dition (A2) and the assumptions on T imply that F4 (x) = F,(T~!(x)). Suppose 
that F} is invertible (i.e., continuous and strictly increasing on K). Then Fy Mu) = 
T(F," (u)). Thus (B1) is equivalent to ET (x) = x for all x in the range of F,', which 
is K. The assertion that (A1) implies (B1), even if F} is not invertible, is proven in 
the next theorem (Theorem 4.2.4) in a more general context. 


which is in turn equivalent to E 


The situation in more than one dimension is similar but the proof is less transpar- 
ent. To avoid compactness assumptions, we introduce the following power growth 
condition (taken from Agueh and Carlier [2]) of continuous functions that grow like 
l-14 @ > 0: 


GAZ) = (+ | DZ) = fr: X —> R continuous : sup Se < =) 
xex 1+|lxII2 
with the norm ||flle, = sup|f(x)|/(1 + lll) = F/G + || - Ile. The space 
G,( X , X) is defined similarly, with f taking values in 2 instead of R, and the 
norm will be denoted in the same way. These are nonseparable Banach spaces. 


Theorem 4.2.4 (Mean Identity Warp Functions and Fréchet Means) Fix A € 
P(X) and let t € G(X, X) be a (Bochner measurable) random optimal map 
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with (Bochner) mean identity and such that E||t||g, < œ. Then A = t#A has Fréchet 
mean À: 


ZIW (A, A)| <E[W3(0,A)] VOEUX). 


The generalisation with respect to the one-dimensional result is threefold. Firstly, 
since our main interest is the implication (Al-A2) = (B1-B2), we need not assume 
T to be injective. Secondly, the support of A is not required to be compact. Lastly, the 
result holds in arbitrary dimension, including infinite-dimensional separable Hilbert 
spaces 2”. In particular, if t is a linear map, then ||t||G, coincides with the operator 
norm of t, so the assumption is that t be a bounded self-adjoint nonnegative operator 
with mean identity and finite expected operator norm. 


Proof. Optimality of t ensures that it has a convex potential @, and strong and weak 


duality give 
1 
J (abP-#0) ao; 
BA 


WOA > f (3-00) 400+ f (3i -60)) aa. 


waa) = f (Simi? 06) a 


Formally taking expectations, using Fubini’s theorem and that E@ = || - ||?/2 (since 
ùt is the identity) yields 


WEAN > f, (Sisi?-Bo)) wE] S (Sip? -9°0)) ao] -emaa 


as required. The rigorous mathematical justification for this is given on page 88 in 
the supplement. 


Remark 4.2.5 The “natural” space for t would be L(A), but without the conti- 
nuity assumption, the result may fail (Álvarez-Esteban et al. [9, Example 3.1]). A 
simple argument shows that the growth condition imposed by the G1 assumption is 
minimal; see page 89 in the supplement or Galasso et al. [58]. 


Remark 4.2.6 The same statement holds if 2 is replaced by a (Borel) convex sub- 
set K thereof. The integrals will then be taken on K, showing that A minimises the 
Fréchet functional among measures supported on K, and, by continuity, on K. By 
Proposition 3.2.4, À is a Fréchet mean. 


4.3 Estimation of Fréchet Means 


4.3.1 Oracle Case 


In view of the canonicity of the Wasserstein geometry in Sect. 4.2.2, separation of 
amplitude and phase variation of the point processes T; essentially requires comput- 
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ing Fréchet means in the 2-Wasserstein space. It is both conceptually important and 

technically convenient to introduce the case where an oracle reveals the conditional 

mean measures A = T#A entirely. Thus, assuming that 2 € #(.2) is the unique 

Fréchet mean of a random measure A, the goal is to estimate the structural mean A 

on the basis of independent and identically distributed realisations A;,...,A, of A. 
Given that À is defined as the minimiser of the Fréchet functional 


EWR (A, Y), YE W(X), 


it is natural to estimate A by a minimiser, say Àn, of the empirical Fréchet functional 
1 n 
R= ÈW AY), VERL) 
i=l 


A minimiser A, exists by Corollary 3.1.3. When 2 = R, A, can be seen to be an 
unbiased estimator of A in a generalised sense of Lehmann [88] (see Sect. 4.3.5). 

The warp maps (and their inverses) can then be estimated as the optimal maps 
from A, to each A; (see Sect. 4.3.4). 


4.3.2 Discretely Observed Measures 


In practice, one does not have the fortune of fully observing the inherently infinite- 
dimensional objects A,,...,A,. A far more realistic scenario is that one only has 
access to a discrete version of A;, say Aj. The simplest situation is when Aj arises 
as an empirical measure of the form t~! Xf; 6{Y;}, where Y; are independent with 
distribution A;. More generally, A; can be a normalised point process Ti; with mean 
measure TAj, i.e. 

~ 1 


A; = =I]; with E[,(A)|Aj]=tAj(A), AC & Borel. 
M(X) 


This encapsulates the case of empirical measure when T is an integer and Ti isa 
binomial point process. The parameter T is the expected number of observed points 
over the entire space 2’; the larger T is, the more information IJ; gives on Aj. 
Except if A; is an empirical measure, there is one difficulty in the above setting 
that needs to be addressed. Unless TI; is binomial, there is a positive probability that 
T7;(2 ) = 0 and no points pertaining to A; are observed. In the asymptotic setup 
below, conditions will be imposed to ensure that this probability becomes negligible 
as n — ce. For concreteness we define A; = 1.) for some fixed measure A) that will 
be of minor importance. This can be a Dirac measure at 0, a certain fixed Gaussian 
measure, or (normalised) Lebesgue measure on some bounded set in case 2 = Ri. 


We can now replace the estimator A, by An, defined as any minimiser of 
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1 n ~ 
i=l 


which exists by Corollary 3.1.3. 

As a generalisation of the discrete case discussed in Sect. 1.3, the Fréchet mean 
of discrete measures can be computed exactly. Suppose that N; = H;(. X ) is nonzero 
for all i. Then each Ai is a discrete measure supported on N; points. One can then re- 
cast the multimarginal formulation (see Sect. 3.1.2) as a finite linear program, solve 
it, and “average” the solution as in Proposition 3.1.2 in order to obtain A, (an al- 
ternative linear programming formulation for finding a Fréchet mean is given by 
Anderes et al. [14]). Thus, Àn can be computed in finite time, even when 2 is 
infinite-dimensional. 

Finally, a remark about measurability is in order. Point processes can be viewed 
as random elements in M} (X) endowed with the vague topology induced from 
convergence of integrals of continuous functions with compact support. If Un con- 
verge to u vaguely, and a, are numbers that converge to a, then ann — au vaguely. 
Thus, A; is a continuous function of the pair (TI;, TI;( 2 )) and can be viewed as a 
random measure with respect to the vague topology. The restriction of the vague 
topology to probability measures is equivalent to the weak topology, and therefore 
vague, weak, and Wasserstein measurability are all equivalent. 


4.3.3 Smoothing 


Even when the computational complexity involved in calculating A, is tractable, 
there is another reason not to use it as an estimator for À. If one has a-priori knowl- 
edge that A is smooth, it is often desirable to estimate it by a smooth measure. One 
way to achieve this would be to apply some smoothing technique to A, using, e.g., 
kernel density estimation. However, unless the number of observed points from each 
measure is the same N; = --- = N, = N, A, will usually be concentrated on many 
points, essentially Nj +----+N, of them. In other words, the Fréchet mean is concen- 
trated on many more points than each of the measures Aj, thus potentially hindering 
its usefulness as a mean because it will not be a representative of the sample. 

This is most easily seen when R= R, in which case each A; is a discrete uniform 
measure on points x} < 5 LURS xy, , where we assume for simplicity that the 
points are not repeated (this will Rapes with probability one if A; is diffuse). If ve 
now set G; to be the distribution function of A;, then the quantile function G;! 
piecewise constant on each interval (k,k + 1]/N; with jumps at 


kiN) =x,  k=1,2,...,N; 


5 In finite dimensional (or more generally, locally compact metric) spaces. If 2 is an infinite- 
dimensional Hilbert space, the vague topology is trivial. This is stated and proved as Lemma 5 on 
page 27 in the supplement. 
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The Fréchet mean has quantile function G™!(u) = n~! 5,G7' (u) and will have 
jumps at every point of the form k/N; for k < N; and i = 1,...,n. In the worst-case 
scenario, when no pair from N; has a common divisor, there will be 


($m-1) +1= ($x) —n+1 


jumps for G7}, which is the number of points on which the Fréchet mean will be 
supported. (All the G7! ’s have a jump at one which thus needs to be counted once 
rather than n times.) 

By counting the number of redundancies in the constraints matrix of the linear 
program, one can show that this is in general an upper bound on the number of 
support points of the Fréchet mean. 

An alternative approach is to first smooth each observation A, and then calculate 
the Fréchet mean. Since it is easy to bound the Wasserstein distances when deal- 
ing with convolutions, we will employ kernel density estimation, although other 
smoothing approaches could be used as well. 

To simplify the exposition, we provide the technical details only when .2 = R4, 
but a similar construction will work when the dimension of 2 is infinite. Let y : 
R? — (0,0) be a continuous, bounded, strictly positive isotropic density function 
with unit variance: y(x) = yı (||x||) with yı nonincreasing and 


faya f voa 


(Besides the boundedness all these properties can be relaxed, and if 2 = R even 
boundedness is not necessary.) A classical example for y is the standard Gaussian 
density in R@. Define the rescaled version Wo(x) = o~“y(x/o) for all o > 0. We 
can then replace A; by a smooth proxy A; * Wo. If A; is a sum of Dirac masses at 
X1,...,Xn,, then 


Ai* Wo has density g(x) = 


If N; = 0, one can either use A or A) x Wo; this event will have negligible prob- 
ability anyway. 7 

For the purpose of approximating Aj, this convolution is an acceptable estimator, 
because as was seen in the proof of Theorem 2.2.7, 


W3 (Aj, Ai* Wo) < 0°. 


But the measure A; has a strictly positive density throughout R“. If we know that A 
is supported on a convex compact K C Rf, it is desirable to construct an estimator 
that has the same support K. The first idea that comes to mind is to project Aj * Wo to 
K (see Proposition 3.2.4), as this will further decrease the Wasserstein distance, but 
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the resulting measure will then have positive mass on the boundary of K, and will 
not be absolutely continuous. We will therefore use a different strategy: eliminate 
all the mass outside K and redistribute it on K. The simplest way to do this is to 
restrict Aj * Wo to K and renormalise the restriction to be a probability measure. 
For technical reasons, it will be more convenient to bound the Wasserstein distance 
when the restriction and renormalisation is done separately on each point of A;. This 
yields the measure 

1 5 ô{xj} * Wo 

Ni Z [ô {xj} * Yo] (K) K 


Lemma 4.4.2 below shows that Ww; (Aj, Aj) < Co” for some finite constant C. It is 
apparent that A; is a continuous function of Aj and o, so A; is measurable; in any 
case this is not particularly important because o will vanish, so A; = A; asymptoti- 
cally and the latter is measurable. 


a 


Â= 


(4.2) 


Our final estimator In for A is defined as the minimiser of 
as 1 Pa 


Since the measures A; are absolutely continuous, Àn is unique. We refer to Àn as the 
regularised Fréchet—Wasserstein estimator, where the regularisation comes from the 
smoothing and the possible restriction to K. 


In the case 2 =R, An can be constructed via averaging of quantile functions. Let 
Gi be the distribution function of Ai. Then An is the measure with quantile function 


= (u) = LG. u € (0,1), 


and distribution function ne 
R (2) = [Fy '@). 


By construction, the Gi are continuous and strictly increasing, so the inverses are 
proper inverses and one does not to use the right-continuous inverse as in Sect. 3.1.4. 

If X = R? and d > 2, then there is no explicit expression for An, although it 
exists and is unique. In the next chapter, we present a steepest descent algorithm that 
approximately constructs An by taking advantage of the differentiability properties 
of the Fréchet functional F, in Sect. 3.1.6. 


4.3.4 Estimation of Warpings and Registration Maps 


Once estimators Aj,i=1,...,n and An are constructed, it is natural to estimate the 
map 7; = ty and its inverse T lig A; (when Aj are absolutely continuous; see the 
discussion after Assumptions 3 below) by the plug-in estimators 
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The latter, the registration maps, can then be used in order to register the points H; 
via 


no = T 4” = a ó n] am”). 


l 


Per 


It is thus reasonable to expect that if T! is a good estimator, then its composition 
with 7; should be close to the identity and H; should be close to I. 


4.3.5 Unbiased Estimation When & =R 


In the same way, Fréchet means extend the notion of mean to non-Hilbertian spaces, 
they also extend the definition of unbiased estimators. Let H be a separable Hilbert 
space (or a convex subset thereof) and suppose that O is a random element in H 
whose distribution Ug depends on a parameter 0 € H. Then @ is unbiased for 0 if 
for all 0 € H 


E90 = J xino) =90. 
H 


(We use the standard notation 2og(0) = f g(x) dug (x) in the sequel.) This is equiv- 
alent to 


zoll — 8l? < Eolly-8l?, Y0, y €H. 


In view of that, one can define unbiased estimators of A € Ws as measurable func- 
tions 6 = 6(Aj,...,An) for which 


2 W3 (à, 8) <E,Wr(7,6),  Yy,0 E€ 9. 


This definition was introduced by Lehmann [88]. 

Unbiased estimators allow us to avoid the problem of over-registering (the so- 
called pinching effect; Kneip and Ramsay [82, Section 2.4]; Marron et al. [90, p. 
476]). An extreme example of over-registration is if one “aligns” all the observed 
patterns into a single fixed point x9. The registration will then seem “successful” 
in the sense of having no residual phase variation, but the estimation is clearly bi- 
ased because the points are not registered to the correct reference measure. Thus, 
requiring the estimator to be unbiased is an alternative to penalising the registration 
maps. 

Due to the Hilbert space embedding of %4 (R), it is possible to characterise unbi- 
ased estimators in terms of a simple condition on their quantile functions. As a corol- 
lary, An, the Fréchet mean of {Aj,...,A,}, is unbiased. Our regularised Fréchet- 
Wasserstein estimator A, can then be interpreted as approximately unbiased, since 
it approximates the unobservable Àn. 
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Proposition 4.3.1 (Unbiased Estimators in %4(R)) Let A be a random measure 
in (R) with finite Fréchet functional and let à be the unique Fréchet mean of A 
(Theorem 3.2.11). An estimator 6 constructed as a function of a sample (A,...,An) 
is unbiased for À if and only if the left-continuous representatives (in L2 (0, 1)) sat- 
isfy E[F 5 '(x)) = Fy ' (x) for all x € (0,1). 


Proof. The proof is straightforward from the definition: 6 is unbiased if and only if 
for all A and all y, 


— —1))2 —1 —14)2 
E, |F! —F5 ‘lz, < EallF -Fs lls 


which is equivalent to Ey [F51] = F; '. In other words, these two functions must 
equal almost everywhere on (0,1), and their left-continuous representatives must 
equal everywhere (the fact that E; [F | has such a representative was established 
in Sect. 3.1.4). 

To show that 6 = A, is unbiased, we simply invoke Theorem 3.2.11 twice to see 
that 


n 


L Er! 
D 


UF; '] =E =E[Fy']=F,', 


which proves unbiasedness of 6. 


4.4 Consistency 


In functional data analysis, one often assumes that the number of curves n and the 
number of observed points per curve m both diverge to infinity. An analogous frame- 
work for point processes would similarly require the number of point processes n as 
well as the expected number of points T per processes to diverge. A technical com- 
plication arises, however, because the mean measures do not suffice to characterise 
the distribution of the processes. Indeed, if one is given a point processes IT with 
mean measure A (not necessarily a probability measure), and T is an integer, there 
is no unique way to define a process IT (*) with mean measure TA. One can define 
TIC) = TIT, so that every point in IT will be counted T times. Such a construction, 
however, can never yield a consistent estimator of A, even when T — œ. 

Another way to generate a point process with mean measure TA is to take a 
superposition of T independent copies of IT. In symbols, this means 


T) =h +- +Ih, 


with (I;) independent, each having the same distribution as IT. This superposition 
scheme gives the possibility to use the law of large numbers. If t is not an integer, 
then this construction is not well-defined but can be made so by assuming that the 
distribution of IT is infinitely divisible. The reader willing to assume that T is always 
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an integer can safely skip to Sect.4.4.1; all the main ideas are developed first for 
integer values of T and then extended to the general case. 
A point process IT is infinitely divisible if for every integer m there exists a 


collection of m independent and identically distributed a” m) such that 
m=" +...4770/™ in distribution. 


If IT is infinitely divisible and t = k/m is rational, then can define nt) using km 
independent copies of IT (1/m), 


7 = sm, 
i=1 


One then deals with irrational T via duality and continuity arguments, as follows. 
Define the Laplace functional of IT by 


Ln(f) =E[e"] = exp (- [rar €[0,1], f: >R, Borel. 


The Laplace functional characterises the distribution of the point process, generalis- 
ing the notion of Laplace transform of a random variable or vector (Karr [79, Theo- 
rem 1.12]). By definition, it translates convolutions into products. When IT = H (1) 
is infinitely divisible, the Laplace functional Lı of IT takes the form (Kallenberg 
[75, Chapter 6]; Karr [79, Theorem 1.43]) 


Li(f)= z fe A — exp -/ a lao for some p € M4 (M4 (2X). 


M(X 


The Laplace functional of IT” is L+ (f) =[Li(f)]* for any rational T, which simply 
amounts to multiplying the measure p by the scalar t. One can then do the same for 
an irrational T, and the resulting Laplace functional determines the distribution of 
IT) for all t > 0. 


4.4.1 Consistent Estimation of Fréchet Means 


We are now ready to define our asymptotic setup. The following assumptions will 
be made. Notice that the Wasserstein geometry does not appear explicitly in these 
assumptions, but is rather derived from them in view of Theorem 4.2.4. The com- 
pactness requirement can be relaxed under further moment conditions on À and the 
point process IT; we focus on the compact case for the simplicity and because in 
practice the point patterns will be observed on a bounded observation window. 
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Assumptions 3 Let K C R? be a compact convex nonempty set, à an absolutely 
continuous probability measure on K, and T, a sequence of positive numbers. Let II 
be a point processes on K with mean measure À. Finally, define U = intK. 


e For every n, let Hm”, ae TL} be independent point processes, each having the 
same distribution as a superposition of T, copies of IT. 

e Let T be a random injective function on K (viewed as a random element in 
C,(K,K) endowed with the supremum norm) such that T(x) € U for x € U (that 
is, T € Cy(U,U)) with nonsingular derivative VT (x) € R?! for almost allx € U, 
that is a gradient of a convex function. Let {T,,...,T,} be independent and iden- 
tically distributed as T. 

e For every x € U, assume that E[T (x)] =x. 


e Assume that the collections {T,}°_, and {0P icnn.. are independent. 
e Let A” = Tan” be the warped point processes, having conditional mean mea- 
sures Aj = THA = 77! AAO T}. 


e Define A; by the smoothing procedure (4.2), using bandwidth ob 
sibly random). 


€ [0, 1] (pos- 


The dependence of the estimators on n will sometimes be tacit. But A; does not 
depend on n. 


By virtue of Theorem 4.2.4, A is a Fréchet mean of the random measure A = T#A. 
Uniqueness of this Fréchet mean will follow from Proposition 3.2.7 if we show 
that A is absolutely continuous with positive probability. This is indeed the case, 
since T is injective and has a nonsingular Jacobian matrix; see Ambrosio et al. [12, 
Lemma 5.5.3]. The Jacobian assumption can be removed when 2 = R, because 
Fréchet means are always unique by Theorem 3.2.11. 

Notice that there is no assumption about the dependence between rows. Assump- 
tions 3 thus cover, in particular, two different scenarios: 


e Full independence: here the point processes are independent across rows, that is, 


a% and ge are also independent. 


; 1 
e Nested observations: here mn ) 


tional points, that is, mnt) 


distributed as (T)41 — n). 


includes the same points as mi”) and addi- 


is a superposition of mi” and another point process 


Needless to say, Assumptions 3 also encompass binomial processes when T, are 
integers, as well as Poisson processes or, more generally, Poisson cluster processes. 

We now state and prove the consistency result for the estimators of the condi- 
tional mean measures A; and the structural mean measure À. 


Theorem 4.4.1 (Consistency) If Assumptions 3 hold, On = n! X}; ot 


most surely and T, —> œ as n — ©, then: 


—> 0al- 


1. The estimators A; defined by (4.2), constructed with bandwidth © = o, are 


l 
Wasserstein-consistent for the conditional mean measures: for all i such that 


o 20 
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Wr (Ai, Ai) aE 0, as n — °°; 


2. The regularised Fréchet—Wasserstein estimator of the structural mean measure 
(as described in Sect. 4.3) is strongly Wasserstein-consistent, 


WO A) £5 0, as n > %. 


Convergence in 1. holds almost surely under the additional conditions that Yy_ | T? 


<» and E [TI(R?)] i < œ. If O, — 0 only in probability, then convergence in 2. still 
holds in probability. 


Theorem 4.4.1 still holds without smoothing (0, = 0). In that case, An = Ñn is pos- 
sibly not unique, and the theorem should be interpreted in a set-valued sense (as in 
Proposition 1.7.8): almost surely, any choice of minimisers An converges to A as 
n— 9, 

The preceding paragraph notwithstanding, we will usually assume that some 
smoothing is present, in which case A, is unique and absolutely continuous by 
Proposition 3.1.8. The uniform Lipschitz bounds for the objective function show 
that if we restrict the relevant measures to be absolutely continuous, then A, is a 
continuous function of (Ay scl An) and hence Ay : (Q, F, P) — W2(K) is measur- 
able; this is again a minor issue because many arguments in the proof hold for each 
@ € Q separately. Thus, even if An is not measurable, the proof shows that the con- 
vergence holds outer almost surely or in outer probability. 

The first step in proving consistency is to show that the Wasserstein distance be- 
tween the unsmoothed and the smoothed estimators of A; vanishes with the smooth- 
ing parameter. The exact rate of decay will be important to later establish the rate of 
convergence of A, to A, and is determined next. 


Lemma 4.4.2 (Smoothing Error) There exists a finite constant Cy g, depending 
only on w and on K, such that 


w2 (nã) ECyxo? POLL (4.3) 


Since the smoothing parameter will anyway vanish, this restriction to small values 
of o is not binding. The constant Cy, x is explicit. When 2° = R, a more refined 
construction allows to improve this constant in some situations, see Panaretos and 
Zemel [100, Lemma 1]. 


Proof. The idea is that (4.2) is a sum of measures with mass 1 /N; that can be all sent 
to the relevant point x;, and we refer to page 98 in the supplement for the precise 
details. 


Proof (Proof of Theorem 4.4.1). The proof, detailed on page 97 of the supplement, 
follows the following steps: firstly, one shows the convergence in probability of A; 
to A;. This is basically a corollary of Karr [79, Proposition 4.8] and the smoothing 
bound from Lemma 4.4.2. 
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To prove claim (2) one considers the functionals, defined on #(K): 


Las 
F(Y) = 5 EWP(A,7); 
1 n 
Fi(Y) = 5— 2, W (Ai y); 
an 
A ay ae ~ A (n) 
FilY) = 5- L W (Ai), Aj=— or AO ) if Ni =0; 
a er ? NO) 
A 1 x pa 7 
BY =5 LW (Ai, Y), Aj =A if Nl =) 


Since K is compact, they are all locally Lipschitz, so their differences can be con- 
trolled by the distances between Aj, Aj, and Âi. The first distance vanishes since 
the intensity T — œ, and the second by the smoothing bound. Another compactness 
argument yields that F, — F uniformly on #%(K), and so the minimisers converge. 

The almost sure convergence in (1) is proven as follows. Under the stronger con- 


ditions at the end of the theorem’s statement, for any fixed a = (a1,...,aq) € RI, 
m(n) 
TI: —oo 
e( i i ,a]) Ai(( co, a} o) = 
n 


by the law of large numbers. This extends to all rational a’s, then to all a by approx- 
imation. The smoothing error is again controlled by Lemma 4.4.2. 


4.4.2 Consistency of Warp Functions and Inverses 


We next discuss the consistency of the warp and registration function estimators. 
These are key elements in order to align the observed point Paneme T;. Recall that 


we have consistent estimators Aj for A; and an for A. Then T; = a is estimated by 


t^ and T; l is estimated by t în. We will make the following extra assumptions that 


lead to more transparent stament (otherwise one needs to replace K with the set 
of Lebesgue points of the supports of A and Aj). 


Assumptions 4 (Strictly Positive Measures) In addition to Assumptions 3 sup- 
pose that: 


1. à has a positive density on K (in particular, suppA = K); 
2. T is almost surely surjective on U = intK (thus a homeomorphism of U). 


As a consequence suppA = supp(T#A) = T (suppå ) = K almost surely. 
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Theorem 4.4.3 (Consistency of Optimal Maps) Let Assumptions 4 be satisfied in 
addition to the hypotheses of Theorem 4.4.1. Then for any i such that ob 2 0 and 
any compact set S C intK, 
supy -r o suple) — Ti(x)|| > 0. 

xES xES 
Almost sure convergence can be obtained under the same provisions made at the 
end of the statement of Theorem 4.4.1. 


A few technical remarks are in order. First and foremost, it is not clear that the 
two suprema are measurable. Even though T; and T=! are random elements in 
C;(U,R¢), their estimators are only defined in an Ly sense. The proof of Theo- 
rem 4.4.3 is done @-wise. That is, for any @ in the probability space such that The- 
orem 4.4.1 holds, the two suprema vanish as n — oo. In other words, the convergence 
holds in outer probability or outer almost surely. 

Secondly, assuming positive smoothing, the random measures A; are smooth with 


densities bounded below on K, so T! are defined on the whole of U (possibly as 


set-valued functions on a A;-null set). But the only known regularity result for A, is 
an upper bound on its density (Proposition 3.1.8), so it is unclear what is its support 
and consequently what is the domain of definition of 7}. 


Lastly, when the smoothing parameter o is zero, T; and T are not defined. 
Nevertheless, Theorem 4.4.3 still holds in the set-valued formulation of Proposi- 
tion 1.7.11, of which it is a rather simple corollary: 


Proof (Proof of Theorem 4.4.3). The proof amounts to setting the scene in order to 
apply Proposition 1.7.11 of stability of optimal maps. We define 


a Aa 


Un = Aj; Vn = Ani u =A; v=A, ie u=T €, 


and verify the conditions of the proposition. The weak convergence of Un to u and 
Vn to v is the conclusion of Theorem 4.4.1; the finiteness is apparent because K is 
compact and the uniqueness follows from the assumed absolute continuity of Aj. 
Since in addition i= is uniquely defined on U = intK which is an open convex set, 
the restrictions on Q in Proposition 1.7.11 are redundant. Uniform convergence of 
T; to T; is proven in the same way. 


Corollary 4.4.4 (Consistency of Point Pattern Registration) For any i such that 


o 40, 


m| = |830. 


The division by the number of observed points ensures that the resulting measures 
are probability measures; the relevant information is contained in the point patterns 
themselves, and is invariant under this normalisation. 
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Proof. The law of large numbers entails that ni”) /T > 1, so in particular ni”) is 


almost surely not zero when n is large. Since i” = (T! o THOI®, we have the 
upper bound 


n (n) ae (n) 
m% T È TT 
wi (| s ITTE) -a a 
N” Ny K N 


i 


Fix a compact Q C intK and split the integral to Q and its complement. Then 


pem (n) (n) 

I; IM’ (K\Q 
| ITRE) -x d+ < dg uN) T 
K\Q Nw) Tn Nn”) 


l l 


3 dA (K\ Q), 


by the law of large numbers, where dx is the diameter of K. By writing intK as a 
countable union of compact sets (and since A is absolutely continuous), this can be 
made arbitrarily small by choice of Q. 
We can easily bound the integral on Q by 
(n) 


ae T; -a Z = 
fiz (Ti(x)) —xl|? ad qy £ SYP |T; '(Ti(x))—x|? = sup IIT 0) -T O). 
Q N xEQ YET;(Q) 


But 7;(Q) is a compact subset of U = intK, because T; € C (U, U). The right-hand 
side therefore vanishes as n —> œ by Theorem 4.4.3, and this completes the proof. 


Possible extensions pertaining to the boundary of K are discussed on page 33 of the 
supplement. 


4.5 Illustrative Examples 


In this section, we illustrate the estimation framework put forth in this chapter 
by considering an example of a structural mean A with a bimodal density on the 
real line. The unwarped point patterns IT originate from Poisson processes with 
mean measure À and, consequently, the warped points IT are Cox processes (see 
Sect. 4.1.2). Another scenario involving triangular densities can be found in Panare- 
tos and Zemel [100]. 


4.5.1 Explicit Classes of Warp Maps 


As a first step, we introduce a class of random warp maps satisfying Assumptions 2, 
that is, increasing maps that have as mean the identity function. The construction is 
a mixture version of similar maps considered by Wang and Gasser [128, 129]. 
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For any integer k define €; : [0, 1] — [0,1] by 


sin( kx) 
|k|z 


Co (x) =x, Ck (x) =x ) k Z\ {0}. (4.4) 
Clearly €,(0) = 0, €,(1) = 1 and ¢; is smooth and strictly increasing for all k. Fig- 
ure 4.4a plots ¢, for k = —3,...,3. To make €; a random function, we let k be an 
integer-valued random variable. If the latter is symmetric, then we have 


a [éx] = x, x € [0,1]. 


By means of mixtures, we replace this discrete family by a continuous one: let J > 1 
be an integer and V = (Vj,...,Vy) be a random vector following the flat Dirichlet 
distribution (uniform on the set of nonnegative vectors with vı +---+vy = 1). Take 
independently k; following the same distribution as k and define 


J 
T(x) = > Vié, (x). (4.5) 
j=l 


Since V; is positive, T is increasing and as (V;) sums up to unity T has mean identity. 
Realisations of these warp functions are given in Fig. 4.4b and c for J = 2 and J = 
10, respectively. The parameters (kj) were chosen as symmetrised Poisson random 
variables: each kj has the law of XY with X Poisson with mean 3 and P(Y = 1) = 
P(Y = —1) = 1/2 for Y and X independent. When J = 10 is large, the function T 
deviates only mildly from the identity, since a law of large numbers begins to take 
effect. In contrast, J = 2 yields functions that are quite different from the identity. 
Thus, it can be said that the parameter J controls the variance of the random warp 
function T. 


Fig. 4.4: (a) The functions {€_3,..., €3}; (b) realisations of T defined by (4.5) with 
J = 2 and kj symmetrisations of Poisson random variables with mean 3; (c) realisa- 
tions of T defined by (4.5) with J = 10 and k; as in (b) 
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4.5.2 Bimodal Cox Processes 


Let the structural mean measure À be a mixture of a bimodal Gaussian distribution 
(restricted to K = [—16, 16]) and a beta background on the interval [— 12, 12], so that 
mass is added at the centre of K but not near the boundary. In symbols this is given 
as follows. Let @ be the standard Gaussian density and let Bap denote the density 
of a the beta distribution with parameters œ and p. Then À is chosen as the measure 
with density 


l-e 


so = Flot 8) +0648) + E Bsa (= 


| ‘ x € [—16, 16], 
(4.6) 


where € € [0,1] is the weight of the beta background. (We ignore the loss of a 
negligible amount of mass due to the restriction of the Gaussians to [— 16, 16].) Plots 
of the density and distribution functions are given in Fig. 4.5. 
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Fig. 4.5: Density and distribution functions corresponding to (4.6) with € = 0 and 
é€=0.15 
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The main criterion for the quality of our regularised Fréchet—Wasserstein estima- 
tor will be its success in discerning the two modes at +8; these will be smeared by 
the phase variation arising from the warp functions. 

We next simulated 30 independent Poisson processes with mean measure À, 
g€ = 0.1, and total intensity (expected number of points) T = 93. In addition, we 
generated warp functions as in (4.5) but rescaled to [—16, 16]; that is, having the 
same law as the functions 


x+16 
f 2T 1 
x3 ( 32 ) 6 


from K to K. These cause rather violent phase variation, as can be seen by the plots 
of the densities and distribution functions of the conditional measures A = THA 
presented in Fig. 4.6a and b; the warped points themselves are displayed in Fig. 4.6c. 

Using these warped point patterns, we construct the regularised Fréchet- 
Wasserstein estimator employing the procedure described in Sect. 4.3. Each T; was 
smoothed with a Gaussian kernel and bandwidth chosen by unbiased cross valida- 
tion. We deviate slightly from the recipe presented in Sect. 4.3 by not restricting the 
resulting estimates to the interval [—16, 16], but this has no essential effect on the 
finite sample performance. The regularised Fréchet—Wasserstein estimator An serves 
as the estimator of the structural mean À and is shown in Fig. 4.7a. It is contrasted 
with A at the level of distribution functions, as well as with the empirical arithmetic 
mean; the latter, the naive estimator, is calculated by ignoring the warping and sim- 
ply averaging linearly the (smoothed) empirical distribution functions across the 
observations. We notice that An is rather successful at locating the two modes of À, 
in contrast with the naive estimator that is more diffuse. In fact, its distribution func- 
tion increases approximately linearly, suggesting a nearly constant density instead 
of the correct bimodal one. 
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Fig. 4.6: (a) 30 warped bimodal densities, with density of A given by (4.6) in solid 
black; (b) their corresponding distribution functions, with that of A in solid black; 
(c) 30 Cox processes, constructed as warped versions of Poisson processes with 
mean intensity 93 f using as warp functions the rescaling to [— 16.16] of (4.5) 
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(a) (b) (c) 


Fig. 4.7: (a) Comparison between the regularised Fréchet—Wasserstein estimator, 
the empirical arithmetic mean, and the true distribution function, including residual 
curves centred at y = 3/4; (b) The estimated warp functions; (c) Kernel estimates 
of the density function f of the structural mean, based on the warped and registered 
point patterns 


Estimators of the warp maps fi, depicted in Fig. 4.7b, and their inverses, are de- 


fined as the optimal maps between A, and the estimated conditional mean measures, 
as explained in Sect. 4.3.4. Then we register the point patterns by applying to them 


the inverse estimators T (Fig. 4.8). Figure 4.7c gives two kernel estimators of the 
density of A constructed from a superposition of all the warped points and all the 
registered ones. Notice that the estimator that uses the registered points is much 
more successful than the one using the warped ones in discerning the two den- 
sity peaks. This is not surprising after a brief look at Fig. 4.8, where the unwarped, 
warped, and registered points are displayed. Indeed, there is very high concentra- 
tion of registered points around the true location of the peaks, +8. This is not the 
case for the warped points because of the phase variation that translates the centres 
of concentration for each individual observation. It is important to remark that the 
fluctuations in the density estimator in Fig.4.7c are not related to the registration 
procedure, and could be reduced by a better choice of bandwidth. Indeed, our pro- 
cedure does not attempt to estimate the density, but rather the distribution function. 
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Fig. 4.8: Bimodal Cox processes: (a) the observed warped point processes; (b) the 
unobserved original point processes; (c) the registered point processes 
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Figure 4.9 presents a superposition of the regularised Fréchet—Wasserstein esti- 
mators for 20 independent replications of the experiment, contrasted with a similar 
superposition for the naive estimator. The latter is clearly seen to be biased around 
the two peaks, while the regularised Fréchet—Wasserstein seems approximately un- 
biased, despite presenting fluctuations. It always captures the bimodal nature of the 
density, as is seen from the two clear elbows in each realisation. 7 

To illustrate the consistency of the regularised Fréchet—-Wasserstein estimator An 
for A as shown in Theorem 4.4.1, we let the number of processes n as well as the 
expected number of observed point per process T to vary. Figures 4.10 and 4.11 
show the sampling variation of Àn for different values of n and T. We observe that as 
either of these increases, the realisations An indeed approach À. The figures suggest 
that, in this scenario, the amplitude variation plays a stronger role than the phase 
variation, as the effect of T is more substantial. 
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Fig. 4.9: (a) Sampling variation of the regularised Fréchet-Wasserstein mean An 
and the true mean measure À for 20 independent replications of the experiment; 
(b) sampling variation of the arithmetic mean, and the true mean measure À for the 
same 20 replications; (c) superposition of (a) and (b). For ease of comparison, all 
three panels include residual curves centred at y = 3/4 


4.5.3 Effect of the Smoothing Parameter 


In order to work with measures of strictly positive density, the observed point pat- 
terns have been smoothed using a kernel function. This necessarily incurs an ad- 
ditional bias that depends on the bandwidth o;. The asymptotics (Theorem 4.4.1) 
guarantee the consistency of the estimators, in particular the regularised Fréchet- 
Wasserstein estimator An, provided that max’!_, 0; — 0. In our simulations, we 
choose oj in a data-driven way by employing unbiased cross validation. To gauge 
for the effect of the smoothing, we carry out the same estimation procedure but with 
o; multiplied by a parameter s. Figure 4.12 presents the distribution function of A, 
as a function of s. Interestingly, the curves are nearly identical as long as s < 1, 
whereas when s > 1, the bias becomes more substantial. 
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These findings are reaffirmed in Fig. 4.13 that show the registered point processes 
again as a function of s. We see that only minor differences are present as s varies 
from 0.1 to 1, for example, in the grey (8), black (17), and green (19) processes. 
When s = 3, the distortion becomes quite more substantial. This phenomenon re- 
peats itself across all combinations of n, T, and s tested. 
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Fig. 4.10: Sampling variation of the regularised Fréchet—Wasserstein mean Àn and 
the true mean measure A for 20 independent replications of the experiment, with € = 
0 and n = 30. Left: T = 43; middle: tT = 93; right: T = 143. For ease of comparison, 
all three panels include residual curves centred at y = 3/4 
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Fig. 4.11: Sampling variation of the regularised Fréchet—Wasserstein mean An and 
the true mean measure A for 20 independent replications of the experiment, with € = 
0 and T = 93. Left: n = 30; middle: n = 50; right: n = 70. For ease of comparison, 
all three panels include residual curves centred at y = 3/4. 


4.6 Convergence Rates and a Central Limit Theorem 
on the Real Line 


Since the conditional mean measures A; are discretely observed, the rate of conver- 
gence of our estimators will be affected by the rate at which the number of observed 
points per process NO) increases to infinity. The latter is controlled by the next 
lemma, which is valid for any complete separable metric space 2’. 
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Lemma 4.6.1 (Number of Points Grows Linearly) Let ni”) =i (X) denote 


l 
the total number of observed points. If Tn / logn — œ, then there exists a constant 


Cr > 0, depending only on the distribution of II, such that almost surely 


m n 
fac miny<i<n N 
lim inf ———>A— 
n= 


> Cr. 


n 
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Fig. 4.12: Regularised Fréchet—Wasserstein mean as a function of the smoothing 
parameter multiplier s, including residual curves. Here, n = 30 and T = 143 
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Fig. 4.13: Registered point processes as a function of the smoothing parameter mul- 
tiplier s. Left: s = 0.1; middle: s = 1; right: s = 3. Here, n = 30 and T = 43 


In particular, there are no empty point processes, so the normalisation is well- 
defined. If II is a Poisson process, then we have the more precise result 


. n 
: miny<i<n Ni ) 
lim —— == 


=1 almost surely. 
n= Tn 
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Remark 4.6.2 One can also show that the limit superior of the same quantity is 
bounded by a constant Ch. If Ta / logn is bounded below, then the same result holds 
but with worse constants. If only Tn, — °°, then the result holds for each i separately 
but in probability. 


The proof is a simple application of Chernoff bounds; see page 108 in the supple- 
ment. 

With Lemma 4.6.1 under our belt, we can replace terms of the order min; ni”) by 
the more transparent order T„. As in the consistency proof, the idea is to write 


F —F, = (F — Fa) + (Fa — È) + (Fn — Rr) 


and control each term separately. The first term corresponds to the phase variation, 
and comes from the approximation of the theoretical expectation F by a sample 
mean F,,. The second term is associated with the amplitude variation resulting from 
observing A; discretely. The third term can be viewed as the bias incurred by the 
smoothing procedure. Accordingly, the rate at which Àn converges to À is a sum 
of three separate terms. We recall the standard Op terminology: if X, and Y, are 
random variables, then X,, = Op (Y„) means that the sequence (X,,/Y,) is bounded in 
probability, which by definition is the condition 
ye >0 3M: supP ( Xn 
n Y, 


n 


>m) <E. 


Instead of X, = Op (Y, ), we will sometimes write Y, > Op(Xn). The former notation 
emphasises the condition that X, grows no faster than Y„, while the latter stresses 
that Y„ grows at least as fast as X, (which is of course the same assertion). Finally, 
Xn = op(Y,) means that X,,/Y,, — 0 in probability. 


Theorem 4.6.3 (Convergence Rates on R) Suppose in addition to Assumptions 3 
that d = 1, t/logn —> œ and that TI is either a Poisson process or a binomial 
process. Then 


` 1 1 T ta 
Wo (àn, A) < Op (=z) +0 (se) +0), On =— 0; ) 
n i=1 


where all the constants in the Op terms are explicit. 


Remark 4.6.4 Unlike classical density estimation, no assumptions on the rate of 
decay of On are required, because we only need to estimate the distribution function 
and not the derivative. If the smoothing parameter is chosen to be o = nje 
for some a > 0 and %,/logn — œ, then by Lemma 4.6.1 On < MaX1<i<n ot = 
Op(t,, “). For example, if Rosenblatt’s rule œ = 1/5 is employed, then the Op(On) 
term can be replaced by Op(1/¥/T). 


One can think about the parameter T as separating the sparse and dense regimes as in 
classical functional data analysis (see also Wu et al. [132]). If t is bounded, then the 
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setting is ultra sparse and consistency cannot be achieved. A sparse regime can be 
defined as the case where T, — œ but slower than logn. In that case, consistency is 
guaranteed, but some point patterns will be empty. The dense regime can be defined 
as Tn >> n’, in which case the amplitude variation is negligible asymptotically when 
compared with the phase variation. 

The exponent — 1/4 of T, can be shown to be optimal without further assump- 
tions, but it can be improved to —1/2 if P(f, > eon K) = 1 for some € > 0, 
where f, is the density of A (see Sect. 4.7). In terms of T, the condition is that 
P(T’ > €) = 1 for some £ and A has a density bounded below. When this is the 
case, T, needs to compared with n rather than n? in the next paragraph and the next 
theorem. 

Theorem 4.6.3 provides conditions for the optimal parametric rate „y/n to be 
achieved: this hie if we set O, to be of the order Op(n = 2) or less and if Tp is 
of the order n? or more. But if the last two terms in Theorem 4.6. 3 are negligible 


-1/2 


with respect to n then a sort of central limit theorem holds for An: 


Theorem 4.6.5 (Asymptotic Normality) In addition to the conditions of Theorem 
4.6.3, assume that Ta /n? — œ, Gn, = op(n™!/?) and A possesses an invertible distri- 
bution function F, on K. Then 


yn (r — i) —>Z weakly in L2(À), 


for a zero-mean Gaussian process Z with the same covariance operator of T (the 
latter viewed as a random element in L2 (à )), namely with covariance kernel 


K(x,y) = cov{7 (x), T(y)}. 


If the density f}, exists and is (piecewise) continuous and bounded below on K, then 
the weak convergence also holds in L2(K). 


In view of Sect. 2.3, Theorem 4.6.5 can be interpreted as asymptotic normality of 
An in the tangential sense: y/n log, (A, n) converges to a Gaussian random element 
in the tangent space Tan}, which is a subset of L2(A). The additional smoothness 
conditions allow to switch to the space L2(K), which is independent of the unknown 
template measure À. 

See pages 109 and 110 in the supplement for detailed proofs of these theorems. 
Below we sketch the main ideas only. 


Proof (Proof of Theorem 4.6.3). The quantile formula W2(y, 0) = ||Fo | -F,! llz20,1) 
from Sect. 1.5 and the average quane formula for the Fréchet mean (Sect. 3.1.4) 
show that the oracle empirical mean Fy- follows a central limit theorem in L2(0, 1). 
Since we work in the Hilbert space L (0,1), Fréchet means are simple averages, 
so the errors in the Fréchet mean have the same rate as the errors in the Fréchet 
functionals. The smoothing term is easily controlled by Lemma 4.4.2. 

Controlling the amplitude term is more difficult. Bounds can be given using the 
machinery sketched in Sect. 4.7, but we give a more elementary proof by reducing 
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to the 1-Wasserstein case (using (2.2)), which can be more easily handled in terms 
of distributions functions (Corollary 1.5.3). 


Proof (Proof of Theorem 4.6.5). The hypotheses guarantee that the amplitude and 
smoothing errors are negligible and 


va(e! -F,") — GP weakly in L2(0,1), 


where GP is the Gaussian process defined in the proof of Theorem 4.6.3. One then 
employs a composition with Fy. 


4.7 Convergence of the Empirical Measure and Optimality 


One may find the term Op(1/</t,) in Theorem 4.6.3 to be somewhat surprising, 
and expect that it ought to be Op(1/,/t,). The goal of this section is to show why 
the rate 1/</T, is optimal without further assumptions and discuss conditions under 
which it can be improved to the optimal rate 1/,/T,. For simplicity, we concentrate 
on the case Tt, = n and assume that the point process IT is binomial; the Poisson 
case being easily obtained from the simplified one (using Lemma 4.6.1). We are 
thus led to study rates of convergence of empirical measures in the Wasserstein 
space. That is to say, for a fixed exponent p > 1 and a fixed measure u E (X), 
we consider independent random variables X,... with law u and the empirical 
measure Un =n |S", 6{X;}. The first observation is that EW, (u, Un) — 0: 


Lemma 4.7.1 Let u € P(X) be any measure. Then 


=0 UEW,(2) 
(H mT LEXIZ). 


Proof. This result has been established in an almost sure sense in Proposition 2.2.6. 
To extend to convergence in expectation observe that 


D 1 p 
WEU) f -yl an omly) = 7D | -XPa 


Thus, the random variable 0 < Y, = WẸ (u, Hn) is bounded by the sample average 
Zn of a random variable V = fọ ||x—Xı ||? du(x) that has a finite expectation. A 
version of the dominated converge theorem (given on page 111 in the supplement) 
implies that EY,, — 0. Now invoke Jensen’s inequality. 


Remark 4.7.2 The sequence EW,(U, Un) is not monotone, as the simple example 
= (69 + 61) /2 shows (see page 111 in the supplement). 


The next question is how quickly EW,(u, Hn) vanishes when u E€ Y,(.2°). We shall 
begin with two simple general lower bounds, then discuss upper bounds in the one- 
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dimensional case, put them in the context of Theorem 4.6.3, and finally briefly touch 
the d-dimensional case. 


Lemma 4.7.3 (vn Lower Bound) Let u € P(X) be nondegenerate. Then there 
exists a constant c(u) > 0 such that for all p > 1 and alln 


c(u) 
vn 


Proof. Let X ~ u and let a Æ b be two points in the support u. Consider f(x) = 
min(1, ||x—al|), a bounded 1-Lipschitz function such that f(a) = 0 < f(b). Then 


Wp (Un, U) > 


2var f (X) 
T 


>0 


= 


ViEW, (Un, U) > VnEW (Mn, u) > E |n! Ñ f(X) —Ef(X)| > 
i=1 


by the central limit theorem and the Kantorovich—Rubinstein theorem (1.11). 


For discrete measures, the rates scale badly with p. More generally: 


Lemma 4.7.4 (Separated Support) Suppose that there exist Borel sets A,B C X 
such that U(AUB) = 1, 


u(A)u(B) >0 and dmin = _inf pll >0. 


xEA,yE 


Then for any p > 1 there exists cp() > 0 such that EW, (Un, 4) > cp()n CP. 


Any nondegenerate finitely discrete measure u satisfies this condition, and so do 
“non-pathological” countably discrete ones. (An example of a “pathological” mea- 
sure is one assigning positive mass to any rational number.) 


Proof. Let k ~ B(n,q = u(A)) denote the number of points from the sample 
(X1,..-,Xn) that fall in A. Then a mass of |k/n — q| must travel between A and B, 
a distance of at least dmin. Thus, W$ (un, 4) > d? n \k/ n— q|, and the result follows 
from the central limit theorem for k; see page 112 in the supplement for the full 
details. 


These lower bounds are valid on any separable metric space. On the real line, it is 
easy to obtain a sufficient condition for the optimal rate n—'/2 to be attained for 
W: since F, (t) ~ B(n, F(t)) has variance F(t)(1 — F(t))/n, we have (by Fubini’s 
theorem and Jensen’s inequality) 


Wi (na) = f EIRO -FOA <n f VFO FO)u, 


so that W; (Un, LL) is of the optimal order n~!/? if 


J(u) := [vFo FO) dt < ee, 
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Since the integrand is bounded by 1/2, this is certainly satisfied if u is compactly 
supported. The J; condition is essentially a moment condition, since for any 6 > 0, 
we have for X ~ u that E|X|?+? < œ => Ji (u) < œ => E|X/? <o. It turns out that 
this condition is necessary, and has a more subtle counterpart for any p > 1. Let f 
denote the density of the absolutely continuous part of u (so f = 0 if u is discrete). 


Theorem 4.7.5 (Rate of Convergence of Empirical Measures) Let p > land u € 
W,(R). The condition 


_ /2 
J(u) = [PO dt<eo,  (0°=1) 


is necessary and sufficient for EWp(Mn, u) = O(n-"/), 


See Bobkov and Ledoux [25, Theorem 5.10] for a proof for the J, condition, and 
Theorems 5.1 and 5.3 for the values of the constants and a stronger result. 

When p > 1, for J (U) to be finite, the support of u must be connected; this is not 
needed when p = 1. Moreover, the J, condition is satisfied when f is bounded below 
(in which case the support of u must be compact). However, smoothness alone does 
not suffice, even for measures with positive density on a compact support. More 
precisely, we have: 


Proposition 4.7.6 For any rate €n — 0 there exists a measure u on |—1,1] with 
positive C” density there, and such that for all n 


EW (Hn, HH) = Chp, u)" OP ep. 


The rate n~!/(P) from Lemma 4.7.4 is the worst among compactly supported mea- 
sures on R. Indeed, by Jensen’s inequality and (2.2), for any u € P((0, 1]), 


[Wp (Hn, H) < [EW? (un, 1]? < [EW (un, W)]!/? < n7" eP., 


The proof of Proposition 4.7.6 is done by “smoothing” the construction in 
Lemma 4.7.4, and is given on page 113 in the supplement. 
Let us now put this in the context of Theorem 4.6.3. In the binomial case, since 


each Wi”) and each A; are independent, we have 


~ —— 1 

ny ~A-\IA: < : 

W2(Ai,Ai)|Ai < 22 (Ai) T 
(In the Poisson case, we need to condition on NO) and then estimate its inverse 
square root as is done in the proof of Theorem 4.6.3.) Therefore, a sufficient con- 
dition for the rate 1/4/Tn to hold is that E\/J2(A) < œ and a necessary condition 
is that P(,/J2(A) < œ) = 1. These hold if there exists 6 > 0 such that with prob- 
ability one A has a density bounded below by 6. Since A = T#A, this will happen 
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provided that A itself has a bounded below density and T has a bounded below 
derivative. Bigot et al. [23] show that the rate ,/T,, cannot be improved. 

We conclude by proving a lower bound for absolutely continuous measures and 
stating, without proof, an upper bound. 


Proposition 4.7.7 Let u € %4 (R?) have an absolutely continuous part with respect 
to Lebesgue measure, and let v, be any discrete measure supported on n points (or 
less). Then there exists a constant C(u) > 0 such that 


Wo (Vn) > Wi (H, Vn) > C(w)n'/4. 


Proof. Let f be the density of the absolutely continuous part Uc, and observe that 
for some finite number M, 


28 = ue({x: f(x) < M}) > 0. 


Let x1,...,Xn be the support points of v, and € > 0. Let Uem be the restriction of 
Ue to the set where the density is smaller than M. The union of balls Bg(x;) has 
Uc.m-Measure of at most 


n 
M ¥ Leb(Be(x;)) = Mne“Lebg(B; (0)) = Mne“Cy = ô, 
i=] 


ife! = ô(nMCay !. Thus, a mass 26 — 6 = 6 must travel more than € from V, to u 
in order to cover Ue m. Hence 


Wi (Vn, H) > be = 8(6/MCq)'/4 n". 


The lower bound holds because we need £7% balls of radius € in order to cover 
a sufficiently large fraction of the mass of u. The determining quantity for upper 
bounds on the empirical Wasserstein distance is the covering numbers 


N(u,€,T) = minimal number of balls whose union has u mass > 1 —T. 


Since u is tight, these are finite for all €, tT > 0, and they increase as € and T approach 
zero. To put the following bound in context, notice that if u is compactly supported 
on R4, then N(u,£,0) < Ke~’. 


Theorem 4.7.8 If for some d>2p, N(u,€,€4?/(4-??))<Ke~4, then EW, <C,n t/i. 


Comparing this with the lower bound in Lemma 4.7.4, we see that in the high- 
dimensional regime d > 2p, absolutely continuous measures have a worse rate than 
discrete ones. In the low-dimensional regime d < 2p, the situation is opposite. We 
also obtain that for d > 2 and a compactly supported absolutely continuous u € 
W (RI), EW, (Hn, 4) ~ 0-4, 
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4.8 Bibliographical Notes 


Our exposition in this chapter closely follows the papers Panaretos and Zemel [100] 
and Zemel and Panaretos [134]. 

Books on functional data analysis include Ramsay and Silverman [109, 110], 
Ferraty and Vieu [51], Horvath and Kokoszka [70], and Hsing and Eubank [71], and 
a recent review is also available (Wang et al. [127]). The specific topic of amplitude 
and phase variation is discussed in [110, Chapter 7] and [127, Section 5.2]. The next 
paragraph gives some selective references. 

One of the first functional registration techniques employed dynamic program- 
ming (Wang and Gasser [128]) and dates back to Sakoe and Chiba [118]. Landmark 
registration consists of identifying salient features for each curve, called landmarks, 
and aligning them (Gasser and Kneip [61]; Gervini and Gasser [63]). In pairwise 
synchronisation (Tang and Miiller [122]) one aligns each pair of curves and then 
derives an estimator of the warp functions by linear averaging of the pairwise reg- 
istration maps. Another class of methods involves a template curve, to which each 
observation is registered, minimising a discrepancy criterion; the template is then 
iteratively updated (Wang and Gasser [129]; Ramsay and Li [108]). James [72] de- 
fines a “feature function” for each curve and uses the moments of the feature func- 
tion to guarantee identifiability. Elastic registration employs the Fisher—Rao metric 
that is invariant to warpings and calculates averages in the resulting quotient space 
(Tucker et al. [123]). Other techniques include semiparametric modelling (Rønn 
[115]; Gervini and Gasser [64]) and principal components registration (Kneip and 
Ramsay [82]). More details can be found in the review article by Marron et al. [90]. 
Wrobel et al. [131] have recently developed a registration method for functional 
data with a discrete flavour. It is also noteworthy that a version of the Wasserstein 
metric can also be used in the functional case (Chakraborty and Panaretos [34]). 

The literature on the point processes case is more scarce; see the review by Wu 
and Srivastava [133]. 

A parametric version of Theorem 4.2.4 was first established by Bigot and 
Klein [22, Theorem 5.1] in Rf, extended to a compact nonparametric formulation 
in Zemel and Panaretos [134]. There is an infinite-dimensional linear version in 
Masarotto et al. [91]. The current level of generality appears to be new. 

Theorem 4.4.1 is a stronger version of Panaretos and Zemel [100, Theorem 1] 
where it was assumed that T, must diverge to infinity faster than logn. An analo- 
gous construction under the Bayesian paradigm can be found in Galasso et al. [58]. 
Optimality of the rates of convergence in Theorem 4.6.3 is discussed in detail by 
Bigot et al. [23], where finiteness of the functional Jz (see Sect. 4.7) is assumed and 
consequently Op(Tn 1/ t is improved to Op(Tn 1/ 3, 

As far as we know, Theorem 4.6.5 (taken from [100]) is the first central limit the- 
orem for Fréchet means in Wasserstein space. When the measures A; are observed 
exactly (no amplitude variation: tT, = œ and o = 0) Kroshnin et al. [84] have re- 
cently proven a central limit theorem for random Gaussian measures in arbitrary 
dimension, extending a previous result of Agueh and Carlier [3]. It seems likely that 
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in a fully nonparametric setting, the rates of convergence (compare Theorem 4.6.3) 
might be slower than „y/n; see Ahidar-Coutrix et al. [4]. 

The magnitude of the amplitude variation in Theorem 4.6.3 pertains to the rates 
of convergence of EW, (Un, L) to zero (Sect. 4.7). This is a topic of intense research, 
dating back to the seminal paper by Dudley [46], where a version of Theorem 4.7.8 
with p = 1 is shown for the bounded Lipschitz metric. The lower bounds proven in 
this section were adapted from [46], Fournier and Guillin [54], and Weed and Bach 
[130]. 

The version of Theorem 4.7.8 given here can be found in [130] and extends 
Boissard and Le Gouic [27]. Both papers [27, 130] work in a general setting of com- 
plete separable metric spaces. An additional logn term appears in the limiting case 
d = 2p, as already noted (for p = 1) by [46], and the classical work of Ajtai et al. [5] 
for u uniform on (0, 1]?. More general results are available in [54]. A longer (but far 
from being complete) bibliography is given in the recent review by Panaretos and 
Zemel [101, Subsection 3.3.1], including works by Barthe, Dobrić, Talagrand, and 
coauthors on almost sure results and deviation bounds for the empirical Wasserstein 
distance. 

The J; condition is due to del Barrio et al. [43], who showed it to be necessary 
and sufficient for the empirical process \/n(F, — F) to converge in distribution to 
Bo F, with B Brownian bridge. The extension to 1 < p < œ (and a lot more) can be 
found in Bobkov and Ledoux [25], employing order statistics and beta distributions 
to reduce to the uniform case. Alternatively, one may consult Mason [92], who uses 
weighted approximations to Brownian bridges. 

An important aspect that was not covered here is that of statistical inference of 
the Wasserstein distance on the basis of the empirical measure. This is a challeng- 
ing question and results by del Barrio, Munk, and coauthors are available for one- 
dimensional, elliptical, or discrete measures, as explained in [101, Section 3]. 
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Chapter 5 A 
Construction of Fréchet Means and gea 
Multicouplings 


When given measures u!,..., uN are supported on the real line, computing their 
Fréchet mean ĀŪ is straightforward (Sect. 3.1.4). This is in contrast to the multivari- 
ate case, where, apart from the important yet special case of compatible measures, 
closed-form formulae are not available. This chapter presents an iterative procedure 
that provably approximates at least a Karcher mean with mild restrictions on the 
measures U!,..., u^. The algorithm is based on the differentiability properties of 
the Fréchet functional developed in Sect.3.1.6 and can be interpreted as classical 
steepest descent in the Wasserstein space #4 (R°). It reduces the problem of finding 
the Fréchet mean to a succession of pairwise transport problems, involving only the 
Monge-Kantorovich problem between two measures. In the Gaussian case (or any 
location-scatter family), the latter can be done explicitly, rendering the algorithm 
particularly appealing (see Sect. 5.4.1). 

This chapter can be seen as a complementary to Chap. 4. On the one hand, one 
can use the proposed algorithm to construct the regularised Fréchet—Wasserstein 
estimator An that approximates a population version (see Sect. 4.3). On the other 
hand, it could be that the object of interest is the sample y!,..., itself, but that 
the latter is observed with some amount of noise. If one only has access to proxies 


u!,...,%, then it is natural to use their Fréchet mean il as an estimator of fl. 
The proposed algorithm can then be used, in principle, in order to construct ĀŪ, and 
the consistency framework of Sect. 4.4 then allows to conclude that if each u’ is 
consistent, then so is ii. 

After presenting the algorithm in Sect.5.1, we make some connections to Pro- 
crustes analysis in Sect.5.2. A convergence analysis of the algorithm is carried out 
in Sect. 5.3, after which examples are given in Sect. 5.4. An extension to infinitely 
many measures is sketched in Sect. 5.5. 
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5.1 A Steepest Descent Algorithm for the Computation 
of Fréchet Means 


Throughout this section, we assume that N is a fixed integer and consider a fixed 
collection 


u',..., u” € (R!) with u! absolutely continuous with bounded density, 
(5.1) 


whose unique (Proposition 3.1.8) Fréchet mean Ū is sought. It has been established 
that if y is absolutely continuous then the associated Fréchet functional 


F(Y) = 55 WHY), yE W(R), 
has Fréchet derivative (Theorem 3.1.14) 
N 
m2 (t —i) € Tan, (5.2) 


F'( = P log,(u') = 


at y. Let yj € #2 (R) be an absolutely continuous measure, representing our current 
estimate of the Fréchet mean at step j. Then it makes sense to introduce a step size 
Tj > 0, and to follow the steepest descent of F given by the negative of the gradient: 


Yit =exp,, (—T/F'(%)) = 


N N 
i+ yi > oe) #yj= Jit uy. dt -»] #Y;. 
i=l 
In order to employ further descent at y;+1, it needs to be verified that F is differen- 
tiable at y;,1, which amounts to showing that the latter stays absolutely continuous. 
This will happen for all but countably many values of the step size Tj, but necessarily 
if the latter is contained in (0, 1]: 


Lemma 5.1.1 (Regularity of the Iterates) Jf % is absolutely continuous and T = 
To € [0,1], then yı = exp, (— TF" (%)) is also absolutely continuous. 


The idea is that push-forwards of j under monotone maps are absolutely continuous 
if and only if the monotonicity is strict, a property preserved by averaging. See page 
118 in the supplement for the details. 

Lemma 5.1.1 suggests that the step size should be restricted to [0, 1]. The next re- 
sult suggests that the objective function essentially tells us that the optimal step size, 
achieving the maximal reduction of the objective function (thus corresponding to an 
approximate line search), is exactly equal to 1. It does not rely on finite-dimensional 
arguments and holds when replacing R@ by a separable Hilbert space. 
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Lemma 5.1.2 (Optimal Stepsize) If % € %4 (R?) is absolutely continuous, then 


2 
F(n)—F(y%) < -IIF’(%)|I? k- z| 


and the bound on the right-hand side of the last display is minimised when T = 1. 


Proof. Let Si = e be the optimal map from % to u’, and set W; = S; — i. Then 


N N 
2NF (1p) -5w wud =¥ flSin E Wany 63 


Both yı and u‘ can be written as push-forwards of % and (2.3) gives the bound 


wenn’) < [ 


For brevity, we omit the subscript (7) from the norms and inner products. De- 


2 


o Oe 
fa me ESS) Si 


an= -ms tw 


j=! 


Rd £7 (%) 


veloping the squares, summing over i = 1,...,N and using (5.3) gives 
N N ] 5 

2NF (1) <$ wi? - 2% Pa (W;,Wj) +N? Dai 

7 ia i 2 A 2 
= 2NF (yo) —2NT 2p" +N? YF -Will , 


and recalling that W; = S; — i yields 


vom 


2 
F(n)—F(p) < © an (w)? k-75) 


To conclude, observe that T — T? /2 is maximised at T = 1. 


In light of Lemmata 5.1.1 and 5.1.2, we will always take T; = 1. The resulting 
iteration is summarised as Algorithm 1. A first step in the convergence analysis is 
that the sequence (F (y;)) is nonincreasing and that for any integer k, 


k k 

5 SII? < È FO) -EF in) =F) -F e) < FW) 
i= j= 

P 


As k — ©, the infinite sum on the left-hand side converges, so || F’ (y;)||/ must vanish 


as j —> œ. 
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Remark 5.1.3 The proof of Proposition 3.1.2 suggests a generalisation of Algo- 
rithm 1 to arbitrary measures in W(R¢) even if none are absolutely continuous. 
One can verify that Lemmata 5.1.2 and 5.3.5 (below) also hold in this setup, so it 
may be that convergence results also apply in this setup. The iteration no longer has 
the interpretation as steepest descent, however. 


Algorithm 1 Steepest descent via Procrustes analysis 


(A) Set a tolerance threshold € > 0. 


(B) For j =0, let y; be an arbitrary absolutely continuous measure. 
(C) For i=1,...,N solve the (pairwise) Monge problem and find the optimal transport map E 


from y; to u’. 


(D) Define the map T; = N`! yi t, 
(Œ) Set Yj+ı = Tj#7;, i.e. push-forward y; via T; to obtain yj+1- 


(F) If ||F’(y;+1)|| < £, stop, and output y;,1 as the approximation of fi and tE as the approxi- 
Vi p put Y; pp u Yj+1 Pp. 


mation of ti ,i=1,...,N. Otherwise, return to step (C). 


5.2 Analogy with Procrustes Analysis 


Algorithm 1 is similar in spirit to another procedure, generalised Procrustes anal- 
ysis, that is used in shape theory. Given a subset B C Rf, most commonly a finite 
collection of labelled points called landmarks, an interesting question is how to 
mathematically define the shape of B. One way to reach such a definition is to dis- 
regard those properties of B that are deemed irrelevant for what one considers this 
shape should be; typically, these would include its location, its orientation, and/or its 
scale. Accordingly, the shape of B can be defined as the equivalence class consist- 
ing of all sets obtained as gB, where g belongs to a collection Y of transformations 
of RI containing all combinations of rotations, translations, dilations, and/or reflec- 
tions (Dryden and Mardia [45, Chapter 4]). 

If Bı and Bz are two collections of k landmarks, one may define the distance 
between their shapes as the infimum of ||B, — gB>||? over the group Z. In other 
words, one seeks to register Bz as close as possible to By by using elements of the 
group ¥Y, with distance being measured as the sum of squared Euclidean distances 
between the transformed points of Bz and those of B,. In a sense, one can think about 
the shape problem and the Monge problem as dual to each other. In the former, one 
is given constraints on how to optimally carry out the registration of the points with 
the cost being judged by how successful the registration procedure is. In the latter, 
one imposes that the registration be done exactly, and evaluates the cost by how 
much the space must be deformed in order to achieve this. 

The optimal g and the resulting distance can be found in closed-form by means 
of ordinary Procrustes analysis [45, Section 5.2]. Suppose now that we are given 
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N > 2 collections of points, By,...,By, with the goal of minimising the sum of 
squares ||¢;B; — g;B;||? over g; € Y.' As in the case of Fréchet means in %4 (R?) 
(Sect. 3.1.2), there is a formulation in terms of sum of squares from the average 
NXg jB j. Unfortunately, there is no explicit solution for this problem when d > 3. 
Like Algorithm 1, generalised Procrustes analysis (Gower [66]; Dryden and Mardia 
[45, p. 90]) tackles this “multimatching” setting by iteratively solving the pairwise 
problem as follows. Choose one of the configurations as an initial estimate/template, 
then register every other configuration to the template, employing ordinary Pro- 
crustes analysis. The new template is then given by the linear average of the regis- 
tered configurations, and the process is iterated subsequently. 

Paralleling this framework, Algorithm | iterates the two steps of registration and 
linear averaging given the current template y;, but in a different manner: 


(1) Registration: by finding the optimal transportation maps th, we identify each 


u’ with the element vA —i=log,, (u'). In this sense, the collection (u!,...,u) 
is viewed in the common coordinate system given by the tangent space at the 
template y; and is registered to it. 

(2) Averaging: the registered measures are averaged linearly, using the common 
coordinate system of the registration step (1), as elements in the linear space 
Tany,. The linear average is then retracted back onto the Wasserstein space via 
the exponential map to yield the estimate at the (j + 1)-th step, y;+1. 


Notice that in the Procrustes sense, the maps that register each u’ to the template 
Yj are One the inverses of ty, We will not use the term “registration maps” in the 
sequel, to avoid possible confusion. 


5.3 Convergence of Algorithm 1 


In order to tackle the issue of convergence, we will use an approach that is specific 
to the nature of optimal transport. This is because the Hessian-type arguments that 
are used to prove similar convergence results for steepest descent on Riemannian 
manifolds (Afsari et al. [1]) or Procrustes algorithms (Le [86]; Groisser [67]) do not 
apply here, since the Fréchet functional may very well fail to be twice differentiable. 

In fact, even in Euclidean spaces, convergence of steepest descent usually re- 
quires a Lipschitz bound on the derivative of F (Bertsekas [19, Subsection 1.2.2]). 
Unfortunately, F is not known to be differentiable at discrete measures, and these 
constitute a dense set in W; consequently, this Lipschitz condition is very unlikely 
to hold. Still, this specific geometry of the Wasserstein space affords some advan- 
tages; for instance, we will place no restriction on the starting point for the iteration, 
except that it be absolutely continuous; and no assumption on how “spread out” the 
collection u!,...,u is will be necessary as in, for example, [1, 67, 86]. 


' One needs to add an additional constraint to prevent registering all the collections to the origin. 
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Theorem 5.3.1 (Limit Points are Karcher Means) Let u',...,u" € W%(R“) be 
probability measures and suppose that one of them is absolutely continuous with 
a bounded density. Then, the sequence generated by Algorithm I stays in a com- 
pact set of the Wasserstein space W(R¢), and any limit point of the sequence is a 
Karcher mean of (u',..., UW). 


Since the Fréchet mean [i is a Karcher mean (Proposition 3.1.8), we obtain imme- 
diately: 


Corollary 5.3.2 (Wasserstein Convergence of Steepest Descent) Under the con- 
ditions of Theorem 5.3.1, if F has a unique stationary point, then the sequence {y;} 
generated by Algorithm 1 converges to the Fréchet mean of {u',...,u%} in the 
Wasserstein metric, 


Wa(y¥j,)—0, j>». 


Alternatively, combining Theorem 5.3.1 with the optimality criterion Theorem 
3.1.15 shows that the algorithm converges to ff when the appropriate assumptions 
on {u'} and the Karcher mean u = lim yj are satisfied. This allows to conclude that 
Algorithm 1 converges to the unique Fréchet mean when u’ are Gaussian measures 
(see Theorem 5.4.1). 

The proof of Theorem 5.3.1 is rather elaborate, since we need to use specific 
methods that are tailored to the Wasserstein space. Before giving the proof, we state 
two important consequences. The first is the uniform convergence of the optimal 
maps A to A on compacta. This convergence does not immediately follow from 
the Wasserstein convergence of y; to 4, and is also established for the inverses. Both 
the formulation and the proof of this result are similar to those of Theorem 4.4.3. 


Theorem 5.3.3 (Uniform Convergence of Optimal Maps) Under the conditions 
of Corollary 5.3.2, there exist sets A,B!,...,BN C RÌ such that Ā(A) = 1 = u! (B!) = 


--» = u^ (BY) and 
wi i|| jee % _ 4h || ee .— 
sup |e; ta | =o, sup tii ti — 0, i=1,...,N, 


Q; 
for any pair of compacta Q; CA, aż C BË. If in addition all the measures u! ,..., u” 
have the same support, then one can choose all the sets B' to be the same. 
The other consequence is convergence of the optimal multicouplings. 


Corollary 5.3.4 (Convergence of Multicouplings) Under the conditions of Corol- 
lary 5.3.2, the sequence of multicouplings 


1 n 
u u ; 
(ti, preety, ay 
of {u',...,u%} converges (in Wasserstein distance on (IR“) ) to the optimal multi- 


i u! u” cee 
coupling (tr petr \#[. 


The proofs of Theorem 5.3.3 and Corollary 5.3.4 are given at the end of the present 
section. 


5.3 Convergence of Algorithm 1 123 


The proof of Theorem 5.3.1 is achieved by establishing the following facts: 


1. The sequence (y;) stays in a compact subset of YW (R?) (Lemma 5.3.5). 

2. Any limit of (yj) is absolutely continuous (Proposition 5.3.6 and the paragraph 
preceding it). 

3. Algorithm 1 acts continuously on its argument (Corollary 5.3.8). 


Since it has already been established that ||F’(y;)|| — 0, these three facts indeed 
suffice. 


Lemma 5.3.5 The sequence generated by Algorithm I stays in a compact subset of 
the Wasserstein space W(R¢). 


Proof. For all j > 1, yj takes the form M,#, where My(x1,...,xv) =X and 7 is 
a multicoupling of u!,..., u^. The compactness of this set has been established in 
Step 2 of the proof of Theorem 3.1.5; see page 63 in the supplement, where this is 
done in a more complicated setup. 


A closer look at the proof reveals that a more general result holds true. Let </ 
denote the steepest descent iteration, that is, 2% (y;) = Yj+1. Then the image of &, 
{AU : u E€ W2(R¢) absolutely continuous} has a compact closure in W(R“). This 
is also true if Rf is replaced by a separable Hilbert space. 

In order to show that a weakly convergent sequence (y;) of absolutely continuous 
measures has an absolutely continuous limit y, it suffices to show that the densities 
of y; are uniformly bounded. Indeed, if C is such a bound, then for any open O C Rf, 
liminf%(0) < CLeb(O), so y(O) < CLeb(O) by the portmanteau Lemma 1.7.1. It 
follows that y is absolutely continuous with density bounded by C. We now show 
that such C can be found that applies to all measures in the image of æ, hence to all 
sequences resulting from iterations of Algorithm 1. 


Proposition 5.3.6 (Uniform Density Bound) For each i = 1,...,N denote by gi 
the density of u' (if it exists) and |\g'\|.. its supremum, taken to be infinite if g' does 
not exist (or if gi is unbounded). Let % be any absolutely continuous probability 
measure. Then the density of y) = &(y) is bounded by the 1/d-th harmonic mean 


of lle! 
i% 1 l“ 
Ea 

i P a 


cos 


The constant C, depends only on the measures (u',...,u), and is finite as long as 
one u’ has a bounded density, since Cy < N“||g'||.. for any i. 


Proof. Let hi be the density of y;. By the change of variables formula, for y-almost 
any x 


ho(x) ight! ho(x) po 
h(t% (%)) = ; g (th, (x)) = ——_,, when g’ exists. 
r det Vt (x) i detVtË (x) 
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(Convex functions are twice differentiable almost surely (Villani [125, Theorem 
14.25]), hence these gradients are well-defined y-almost surely.) We seek a lower 
bound on the determinant of Vt% (x), which by definition equals 


N ; 
N~“ det $ Vth (x) 
i=l 


Such a bound is provided by the Brunn—Minkowski inequality (Stein and Shakarchi 
[121, Section 1.5]) for symmetric positive semidefinite matrices 


[det(A + B)]!/4 > [det A]!/4 + [det B]!/4, 


which, applied inductively, yields 


N d 
[det Vt} (x Je 2 [act Wel, (x )| a 


ae 


From this, we obtain an upper bound for h: 


1 det!/4y% | Vee (x 


thot ya 
nO) NG) 


N 
SP ier o lee 


Let X be the set of points where this inequality holds, then ~(2) = 1. Hence 
n (tp (Z)) = Plte) (tH) = (2) =1. 
Thus, y,-almost surely and for all i, 
hı (y) < Cu. 


The third statement (continuity of 2) is much more subtle to establish, and its rather 
lengthy proof is given next. In view of Proposition 5.3.6, the uniform bound on the 
densities is not a hindrance for the proof of convergence of Algorithm 1. 


Proposition 5.3.7 Let (y,) be a sequence of absolutely continuous measures with 
uniformly bounded densities, suppose that W2(Yn, Y) —> 0, and let 


1 n 1 n 
nj = (thy tH oi) AY, n=(t} E si) #7, 
Then nj >N in Pa (RIAH). 


Proof. As has been established in the discussion before Proposition 5.3.6, the limit 
y must be absolutely continuous, so 17) is well-defined. 

In view of Theorem 2.2.1, it suffices to show that if h : (R“)N+! — R is any 
continuous nonnegative function such that 
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= 


2 
lati, sty) S ay 2 lal? + 2llyl, 


i=l 


then 
= _ preh’ u” 
f n= | +f pope = [89% s= (2), ---5 (2).2), 


and g defined analogously. The proof, given in full detail on page 124 of the supple- 
ment, is sketched here. 

Step 1: Truncation. Since y, converge in the Wasserstein space, they satisfy the 
uniform integrability (2.4) and absolute continuity (2.7) by Theorem 2.2.1. Conse- 
quently, gar = min(g,,4R) is uniformly close to gy: 


sup fien) sardio. Rx, 


We may thus replace g, by a bounded version gn,R. 
: ap : u’ 
Step 2: Convergence of gn to g. By Proposition 1.7.11, the optimal maps ty, 


converge to t and (since h is continuous), g, — g uniformly on “nice” sets Q C 
E =suppy. Write 


[ans dy, - [ seay= fer d(m-— Y) +f (8n, R — 8R) d% +f (8n,R — 8R) dh- 
Q RAQ 


Step 3: Bounding the first two integrals. The first integral vanishes as n — ©, 
by the portmanteau Lemma 1.7.1, and the second by uniform convergence. 

Step 4: Bounding the third integral. The integrand is bounded by 8R, so it suf- 
fices to bound the measures of R? \ Q. This is a bit technical, and uses the uniform 
density bound on (y,,) and the portmanteau lemma. 


Corollary 5.3.8 (Continuity of 7) [fW2(%,,7) > 0 and Y% have uniformly bounded 
densities, then % (Y) > L(Y). 


Proof. Choose h in the proof of Proposition 5.3.7 to depend only on y. 


Proof (Proof of Corollary 5.3.4). Choose h in the proof of Proposition 5.3.7 to de- 
pend only on f),...,fp. 


Proof (Proof of Theorem 5.3.3). Let E = suppfi and set Ai = E” N {x : i (x) is 
univalued}. As fi is absolutely continuous, fi(A') = 1, and the same is true for A = 
gna A! . The first assertion then follows from Proposition 1.7.11. 

The second statement is proven similarly. Let E’ = suppu’ and notice that by 
absolute continuity the Bi = (EŻ) A {x: ti ;(x) is univalued} has measure 1 with 
respect to u’. Apply Proposition 1.7.11. If in addition Et =--- = EN, then u’ (B) = 1 
for B= MB". 
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5.4 Illustrative Examples 


As an illustration, we implement Algorithm 1 in several scenarios for which pair- 
wise optimal maps can be calculated explicitly at every iteration, allowing for fast 
computation without error propagation. In each case, we give some theory first, de- 
scribing how the optimal maps are calculated, and then implement Algorithm 1 on 
simulated examples. 


5.4.1 Gaussian Measures 


No example illustrates the use of Algorithm 1 better than the Gaussian case. This is 
so because optimal maps between centred nondegenerate Gaussian measures with 
covariances A and B have the explicit form (see Sect. 1.6.3) 


t(x) =A V/A? pal PA x, xe RY, 


with the obvious slight abuse of notation. In contrast, the Fréchet mean of a collec- 
tion of Gaussian measures (one of which nonsingular) does not admit a closed-form 
formula and is only known to be a Gaussian measure whose covariance matrix I” is 
the unique invertible root of the matrix equation 


N 
pad y gp (5.4) 


N i=l 


where S; is the covariance matrix of u’. 

Given the formula for A application of Algorithm 1 to Gaussian measures is 
straightforward. The next result shows that, in the Gaussian case, the iterates must 
converge to the unique Fréchet mean, and that (5.4) can be derived from the charac- 
terisation of Karcher means. 


Theorem 5.4.1 (Convergence in Gaussian Case) Let u',...,u be Gaussian mea- 
sures with zero means and covariance matrices S; with Sı nonsingular, and let the 
initial point % be NV (0,IQ) with Ig nonsingular. Then the sequence of iterates gen- 
erated by Algorithm 1 converges to the unique Fréchet mean of (u',...,L). 


Proof. Since the optimal maps are linear, so is their mean and therefore y% is a 
Gaussian measure for all k, say “~ (0, j,) with T} nonsingular. Any limit point of y 
is a Karcher mean by Theorem 5.3.1. If we knew that y itself were Gaussian, then it 
actually must be the Fréchet mean because N~! ad equals the identity everywhere 
on Rf (see the discussion before Theorem 3.1.15). 
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Let us show that every limit point y is indeed Gaussian. It suffices to prove that 
(Ij) is a bounded sequence, because if T} + I’, then VW (0, I.) > -(0,T) weakly, 
as can be seen from either Lehmann-Scheffé’s theorem (the densities converge) or 
Lévy’s continuity theorem (the characteristic functions converge). 

To see that (Ij,) is bounded, observe first that for any centred (Gaussian or not) 
measure u with covariance matrix S, 


W3 (u, ĉo) = tr[S], 


where 6p is a Dirac mass at the origin. (This follows from the spectral decomposition 
of S.) Therefore 
0 < wll] = W2 (%. ĉo) 


is bounded uniformly, because {y} stays in a Wasserstein-compact set by Lemma 
5.3.5. If we define C = sup, tr[I;] < œ, then all the diagonal elements of T} are 
bounded uniformly. When A is symmetric and positive semidefinite, 2|A; jil <A + 
Aij. Consequently, all the entries of T% are bounded uniformly by C, which means 
that (T}) is a bounded sequence. 


From the formula for the optimal maps, we see that if I is the covariance of the 
Fréchet mean, then 


C1 afpera 
>] [r ST | r 


and we recover the fixed point equation (5.4). 

If the means are nonzero, then the optimal maps are affine and the same result 
applies; the Fréchet mean is still a Gaussian measure with covariance matrix I” and 
mean that equals the average of the means of u’, i= 1,...,N. 

Figure 5.1 shows density plots of N = 4 centred Gaussian measures on R? 
with covariances S; ~ Wishart(/,2), and Fig. 5.2 shows the density of the result- 
ing Fréchet mean. In this particular example, the algorithm needed 11 iterations 
starting from the identity matrix. The corresponding optimal maps are displayed 
in Fig.5.3. It is apparent from the figure that these maps are linear, and after a 
more careful reflection one can be convinced that their average is the identity. The 
four plots in the figure are remarkably different, in accordance with the measures 
themselves having widely varying condition numbers and orientations; > and more 
so uł are very concentrated, so the optimal maps “sweep” the mass towards zero. 
In contrast, the optimal maps to u! and u? spread the mass out away from the 
origin. 
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Fig. 5.1: Density plot of four Gaussian measures in R*. 


Fig. 5.2: Density plot of the Fréchet mean of the measures in Fig. 5.1 
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Fig. 5.3: Gaussian example: vector fields depicting the optimal maps x > th (x) 


from the Fréchet mean fi of Fig. 5.2 to the four measures {u} of Fig. 5.1. The order 
corresponds to that of Fig. 5.1 


5.4.2 Compatible Measures 


We next discuss the behaviour of the algorithm when the measures are compatible. 
Recall that a collection @ C #( 2%’) is compatible if for all y,p, U E€ ©, ti o t = ty 
in L2(y) (Definition 2.3.1). Boissard et al. [28] showed that when this condition 
holds, the Fréchet mean of (u!,...,) can be found by simple computations in- 
volving the iterated barycentre. We again denote by % the initial point of Algo- 
rithm 1, which can be any absolutely continuous measure. 


Lemma 5.4.2 (Compatibility and Convergence) /f % U { u'} is compatible, then 
Algorithm 1 converges to the Fréchet mean of (u') after a single step. 


Proof. By definition, the next iterate is 
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1 5 ui 
SN ty #Y, 
Ni 
which is the Fréchet mean by Theorem 3.1.9. 


In this case, Algorithm | requires the calculation of N pairwise optimal maps, and 
this can be reduced to N — 1 if the initial point is chosen to be u!. This is the same 
computational complexity as the calculation of the iterated barycentre proposed in 
[28]. 

When the measures have a common copula, finding the optimal maps reduces to 
finding the optimal maps between the one-dimensional marginals (see Lemma 2.3.3) 
and this can be done using quantile functions as described in Sect. 1.5. The marginal 
Fréchet means are then plugged into the common copula to yield the joint Fréchet 
mean. We next illustrate Algorithm | in three such scenarios. 


5.4.2.1 The One-Dimensional Case 


When the measures are supported on the real line, there is no need to use the al- 
gorithm since the Fréchet mean admits a closed-form expression in terms of quan- 
tile functions (see Sect. 3.1.4). We nevertheless discuss this case briefly because we 
build upon this construction in subsequent examples. Given that th = F;! o Fu, we 
may apply Algorithm 1 starting from one of these measures (or any other measure). 
Figure 5.4 plots N = 4 univariate densities and the Fréchet mean yielded by the 
algorithm in two different scenarios. At the left, the densities were generated as 


i o1 x—mi 1 x—mi, 
Fs) = 50 (Set) E) 65) 


with @ the standard normal density, and the parameters generated independently as 


mi ~ U[-13,—3], mi ~U[3,13], 01,05 ~ Gamma(4,4). 
At the right of Fig. 5.4, we used a mixture of a shifted gamma and a Gaussian: 


3 B 


; ; ay 2 j 
PQ) = gpa ET e A + = m), (5.6) 


with l l 
P' ~ Gamma(4,1), m3 ~U[1,4], m4 ~ U[-4,—I]. 


The resulting Fréchet mean density for both settings is shown in thick light blue, 
and can be seen to capture the bimodal nature of the data. Even though the Fréchet 
mean of Gaussian mixtures is not a Gaussian mixture itself, it is approximately so, 
provided that the peaks are separated enough. Figure 5.5 shows the optimal maps 
pushing the Fréchet mean ji to the measures u!,..., u^ in each case. If one ignores 
the “middle part” of the x axis, the maps appear (approximately) affine for small 
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values of x and for large values of x, indicating how the peaks are shifted. In the 
middle region, the maps need to “bridge the gap” between the different slopes and 
intercepts of these affine maps. 
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Fig. 5.4: Densities of a bimodal Gaussian mixture (left) and a mixture of a Gaussian 
with a gamma (right), with the Fréchet mean density in light blue 


AOUN- 


=107 -=5 0 5 10 15 -6 -4 -2 0 2 4 6 


Fig. 5.5: Optimal maps t4 ` from the Fréchet mean fi to the four measures {u} in 
Fig. 5.4. The left plot corresponds to the bimodal Gaussian mixture, and the right 
plot to the Gaussian/gamma mixture 


5.4.2.2 Independence 
We next take measures u’ on R?, having independent marginal densities fi as in 


(5.5), and fi as in (5.6). Figure 5.6 shows the density plot of N = 4 such measures, 
constructed as the product of the measures from Fig. 5.4. One can distinguish the 
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independence by the “parallel” structure of the figures: for every pair (y,,y2), the 
ratio g(x,y1)/g(x,y2) does not depend on x (and vice versa, interchanging x and 
y). Figure 5.7 plots the density of the resulting Fréchet mean. We observe that the 
Fréchet mean captures the four peaks and their location. Furthermore, the parallel 
nature of the figure is preserved in the Fréchet mean. Indeed, by Lemma 3.1.11 the 
Fréchet mean is a product measure. The optimal maps, in Fig. 5.10, are the same as 
in the next example, and will be discussed there. 


Fig. 5.6: Density plots of the four product measures of the measures in Fig. 5.4 


Fig. 5.7: Density plot of the Fréchet mean of the measures in Fig. 5.6 
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5.4.2.3 Common Copula 


Let u’ be a measure on RÊ? with density 


gy) = CFO) FO) AOA), 


where fi and F are random densities on the real line with distribution functions F} 
and Fy, and c is a copula density. Figure 5.8 shows the density plot of N = 4 such 
measures, with fi generated as in (5.5), fi as in (5.6), and c is the Frank(—8) copula 
density, while Fig. 5.9 plots the density of the Fréchet mean obtained. (For ease of 
comparison we use the same realisations of the densities that appear in Fig. 5.4.) The 
Fréchet mean can be seen to preserve the shape of the density, having four clearly 
distinguished peaks. Figure 5.10, depicting the resulting optimal maps, allows for a 
clearer interpretation: for instance, the leftmost plot (in black) shows more clearly 
that the map splits the mass around x = —2 to a much wider interval; and conversely 
a very large amount mass is sent to x ~ 2. This rather extreme behaviour matches 
the peak of the density of u! located at x = 2. 


Fig. 5.8: Density plots of four measures in R? with Frank copula of parameter —8 
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Fig. 5.10: Frank copula example: vector fields of the optimal maps i from the 


Fréchet mean ji of Fig. 5.9 to the four measures {,1'} of Fig. 5.8. The colours match 
those of Fig. 5.4 
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5.4.3 Partially Gaussian Trivariate Measures 


We now apply Algorithm | in a situation that entangles two of the previous settings. 
Let U be a fixed 3 x 3 real orthogonal matrix with columns U1, U2, U3 and let u’ 
have density 


inya) =s") = U9) eee 
? 2 = y == - CX - ? 
8 \V1592, 93) = 8 3Y 2n dets P 7 


with f’ bounded density on the real line and SÍ € R?*? positive definite. We sim- 
ulated N = 4 such densities with f’ as in (5.5) and S? ~ Wishart(/,,2). We apply 
Algorithm 1 to this collection of measures and find their Fréchet mean (see the end 
of this subsection for precise details on how the optimal maps were calculated). 
Figure 5.11 shows level set of the resulting densities for some specific values. The 
bimodal nature of f implies that for most values of a, {x : f(x) = a} has four el- 
ements. Hence, the level sets in the figures are unions of four separate parts, with 
each peak of f! contributing two parts that form together the boundary of an ellip- 
soid in R? (see Fig. 5.12). The principal axes of these ellipsoids and their position in 
R? differ between the measures, but the Fréchet mean can be viewed as an average 
of those in some sense. 

In terms of orientation (principal axes) of the ellipsoids, the Fréchet mean is most 
similar to u! and u?, whose orientations are similar to one another. 


6-4-2 02 4 —6-4-20 2 46 -10 - 0 5 


4-2 0 2 4 -6-4-20246 


Fig. 5.11: The set {v € R? : g'(v) = 0.0003} for i = 1 (black), the Fréchet mean 
(ight blue), i = 2,3,4 in red, green, and dark blue, respectively 
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Let us now see how the optimal maps are calculated. If Y = (y1 ,y2,y3) ~ u’, then 
the random vector (x1 ,x2,x3) =X =U7'Y has joint density 


l aD E] 1 
fi(xs)exp : NES 


so the probability law of X is p'@ v' with p! centred Gaussian with covariance 
matrix X’ and v’ having density f! on R. By Lemma 3.1.11, the Fréchet mean of 
(U—'#u') is the product measure of that of (p') and that of (v'); by Lemma 3.1.12, 
the Fréchet mean of (1') is therefore 


N aan 
UNOD, f=, F= EE a), AG)= f fois, 
i=1 TOR 


where X is the Fréchet-Wasserstein mean of 2),..., 2. 


-6-4-2 0246 


Fig. 5.12: The set {v € R? : g'(v) = 0.0003} for i = 3 (left) and i = 4 (right), with 
each of the four different inverses of the bimodal density f! corresponding to a 
colour 


Starting at an initial point y = U#(.V (0,20) 8 vo), with Vo having continuous 
distribution Fw, the optimal maps are U o t} o U7! = V (pġ o UT!) with 


ty) (x1,x2) ) 
Fy! o Fy (x3) 


to(x1,x2,x3) = ( 
the gradients of the convex function 
i xi (x1 *3 oy 
gleaaa) = (eaa) (7!) + f EP Fin l))ds, 
where we identify A with the positive definite matrix (Z")!/2[(2!)!/2Eq(z")!/2]-!/2 
(Z5) !/2 that pushes forward ⁄ (0, Z0) to M (0,2"). Due to the one-dimensionality, 


the algorithm finds the third component of the rotated measures after one step, but 
the convergence of the Gaussian component requires further iterations. 
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5.5 Population Version of Algorithm 1 


Let A € W%(R“) be a random measure with finite Fréchet functional. The population 
version of (5.1) is 


q=P(A absolutely continuous with density bounded by M) >0 for some M < %, 
(5.7) 
which we assume henceforth. This condition is satisfied if and only if 


P(A absolutely continuous with bounded density) > 0. 
These probabilities are well-defined because the set 
W(R1;M) = {u € W(R®) : u absolutely continuous with density bounded by M} 
is wey closed (see the paragraph before Proposition 5.3.6), hence a Borel set of 


In light of Theorem 3.2.13, we can define a population version of Algorithm 1 
with the iteration function 


A(Y)= cty y € Ws(R*) absolutely continuous. 
The (Bochner) expectation is well-defined in -⁄ (y) because the random map t3 is 
measurable (Lemma 2.4.6). Since -⁄ (y) is a Hilbert space, the law of large num- 
bers applies there, and results for the empirical version carry over to the population 
version by means of approximations. In particular: 


Lemma 5.5.1 Any descent iterate y has density bounded by q~4M, where q and M 
are as in (5.7). 


Proof. The result is true in the empirical case, by Proposition 5.3.6. The key point 
(observed by Pass [102, Subsection 3.3]) is that the number of measures does not 
appear in the bound g~@M. 

Let A,... be a sample from A and let qn be the proportion of measures in 
(A1,..-,An) that have density bounded by M. Then both n`! Xia iy > Et; and 
qn — q almost surely by the law of large numbers. Pick any œ in the probability 
space for which this happens and notice that (invoking Lemma 2.4.5) 


ol ae ont at | hee cae 
AY) = E$ Fel #y= lim FE EH #y. 


Let A, denote the measure in the last limit. By Proposition 5.3.6, its density is 
bounded by q, 4M — q~¢M almost surely, so for any C > q~¢M and n large, A, has 
density bounded by C. By the portmanteau Lemma 1.7.1, so does lim A, = | th Jit. 


Now let C N g-@M. 
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Though it follows that every Karcher mean of A has a bounded density, we cannot 
yet conclude that the same bound holds for the Fréchet mean, because we need an a- 
priori knowledge that the latter is absolutely continuous. This again can be achieved 
by approximations: 


Theorem 5.5.2 (Bounded Density for Population Fréchet Mean) Let A € %(R“) 
be a random measure with finite Fréchet functional. If A has a bounded density 
with positive probability, then the Fréchet mean of A is absolutely continuous with 
a bounded density. 


Proof. Let g and M be as in (5.7), Aj,... be a sample from A, and qn the proportion 
of (A;)i<n with density bounded by M. The empirical Fréchet mean À, of the sample 
(Aj,...,An) has a density bounded by q; M. The Fréchet mean A of A is unique by 
Proposition 3.2.7, and consequently A, — A in #4 (R?) by the law of large numbers 
(Corollary 3.2.10). For any C > limsupg, 4M, the density of A is bounded by C 
by the portmanteau Lemma 1.7.1, and the limsup is g~7M almost surely. Thus, the 
density is bounded by g~¢M. 


In the same way, one shows the population version of Theorem 3.1.9: 


Theorem 5.5.3 (Fréchet Mean of Compatible Measures) Let A : (Q, F,P) > 
W(IR¢) be a random measure with finite Fréchet functional, and suppose that with 
positive probability A is absolutely continuous and has bounded density. If the col- 
lection {y} UA (Q) is compatible and y is absolutely continuous, then | ot? jy is 
the Fréchet mean of A. 


It is of course sufficient that {y} UA (Q \ M) be compatible for some null set 
NCQ. 


5.6 Bibliographical Notes 


The algorithm outlined in this chapter was suggested independently in this steepest 
descent form by Zemel and Panaretos [134] and in the form a fixed point equation it- 
eration by Alvarez-Esteban et al. [9]. These two papers provide different alternative 
proofs of Theorem 5.3.1. The exposition here is based on [134]. Although longer and 
more technical than the one in [9], the formalism in [134] is amenable to directly 
treating the optimal maps (Theorem 5.3.3) and the multicouplings (Corollary 5.3.4). 
On the flip side, it is noteworthy that the proof of the Gaussian case (Theorem 5.4.1) 
given in [9] is more explicit and quantitative; for instance, it shows the additional 
property that the traces of the matrix iterates are monotonically increasing. 

Developing numerical schemes for computing Fréchet means in W(IR“) is a very 
active area of research, and readers are referred to the recent monograph of Peyré 
and Cuturi [103, Section 9.2] for a survey. 
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In recent work, Backhoff-Varaguas et al. [15] propose a stochastic steepest de- 
scent for finding Karcher means of a population Fréchet functional associated with 
arandom measure A. At iterate j, one replaces y; by 


[rity + (1 —4))i ty, byw A. 


The analogue of Theorem 5.3.1 holds under further conditions. 
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